Integer NaN in Rust - AJ Frantz's Blog

Today I want to talk about one cool little thing about Rust that I learned recently.

Rust has a widely used Option type with great language support, and using it prevents a whole class of errors commonly encountered in C++. As a bare metal, cycle-counting aficionado I see these cool features and can’t help but wonder “Sure, but what does it cost?”

It turns out Rust also implements a “null pointer optimization” which allows it to represent an Option of pointer types exactly like you would in C++–as a simple pointer, but with the benefit that the programmer is forced to check it properly. It doesn’t even stop there! Other types with “invalid” encodings can be optimized down to smaller representations, too.

So, how about a common use case for systems using limited communication bandwidth: integer values with a not-a-number canary? It turns out we can get the same optimization to fire!

Background

There’s a (surprisingly common) niche of system design where you both highly care about accuracy but don’t have a ton of bandwidth available–think about devices with 12-bit ADCs hooked up via a CAN bus, for example. If the signal you’re measuring is sometimes unavailable (maybe it can only be measured in certain modes), how do you represent it on the wire?

Obviously a 12-bit integer works well to cover the range of measured values, but in particular, what should you send if it’s not available at all? You have two options: out-of-band indication, or in-band indication. If you choose out-of-band, you’d indicate ‘mode’ / ‘validity’ somewhere else, and then populate the field with some filler–usually just zero, or maybe the last-measured value. This “costs” an extra bit if you weren’t planning to send mode along already, and moreover, if code just blindly reads the value without checking the mode first, you can accidentally read a value that’s stale or incorrect.

The most common in-band alternative is to pick a single integer value and reserve it to mean “not-a-number” / “signal not available” – the usual choice being the maximum possible value for the field (i.e. 0xfff for a 12-bit field). The value is near saturation anyway, so the odds that your system is designed to operate up there are low. It requires no extra data, and when read in isolation it’s “supposedly” easier to avoid making computational errors.

Supposedly gets scare quotes because if you just stuff the value into a raw integer, no programming language will give you a ton of protection–you might add/subtract/multiply/divide it without further thought. You could “promote” the values into floating point numbers on either side of the wire–floats have a native NaN representation–but that requires casting integer -> float which increases size, plus there still exist systems without hardware floating point. Plus, you can use NaN floats by accident just as easily, and worse they’ll silently taint all downstream computations.

A Rusty Approach

What we really want is to represent things as a possibly-absent value: Option<u16>. We can compare against the invalid value when we deserialize the integer off-the-wire, and use it robustly internally.

pub fn deserialize(raw: u16) -> Option<u16> {
    if raw > 0xffe {
        return None;
    }
    Some(raw)
}

If we visit our good friend godbolt, we see this compiles into something like the following:

example::deserialize:
        mov     r1, r0
        uxth    r2, r0
        movs    r0, #0
        movw    r3, #4095
        cmp     r2, r3
        it      lo
        movlo   r0, #1
        bx      lr

Our input parameter is in r0, our output is in (r0, r1), represented as (Some/None flag, value). We immediately copy the value into r1–if we return None, it doesn’t matter what it was anyway–and then check if the value is in our allowed range, setting r0 = 0 if not, or to 1 if it is. Straightforward, but note that our return value is now in two registers, and if we were storing it to RAM we’d have to allocate std::mem::size_of::<Option<u16>>() = 4 bytes, double the normal size.

We see this again when we’re accessing the value, which we can do by either preserving the optiony-ness or by assuming a default value if the input was not available. We’ll apply a scale and an offset, as one might do with a typical ADC calibration:

pub fn mapping_use(v: Option<u16>) -> Option<u16> {
    v.map(|x| 2 * x + 1)
}

pub fn default_use(v: Option<u16>) -> u16 {
    2 * v.unwrap_or(0) + 1
}

example::mapping_use:
        lsls    r1, r1, #1
        uxth    r0, r0
        adds    r1, #1
        cmp     r0, #0
        it      ne
        movne   r0, #1
        bx      lr

example::default_use:
        lsls    r1, r1, #1
        adds    r1, #1
        lsls    r0, r0, #16
        it      eq
        moveq   r1, #1
        mov     r0, r1
        bx      lr

Seven instruction in both cases; we see the arithmetic operations occurring against r1 (the possibly-valid value), intermingled with checks of the r0 register to decide how to populate the result. For some reason LLVM randomly decides to explicitly compare against zero or shift bits off the left side of the integer, but hey, it works either way. Basically we conduct the math operation unconditionally, and mark the result as Some after the fact if it was valid.

As an overall strategy this wasn’t awesome: we did successfully wrap the value so that it was impossible to accidentally use an invalid value in the arithmetic operation, but the solution is larger in memory than the raw integer, and we have this additional logic to populate the validity flag hanging around everywhere. Certainly not “zero cost” as abstractions go.

Honey I Shrunk the Option

However! We can play the same null pointer optimization game that Rust does elsewhere by using a magical type NonZeroU16. This type promises that a 0 value will never be used in the field, and allows the compiler to do the magic optimizations for us. Now, unfortunately it isn’t MaxValueU16: we’ll have to shift our “valid” values from 0-4094 to 1-4095 in order to placate the optimizer gods. That sounds like more work, not less, but let’s try it out anyway.

use core::num::NonZeroU16;

pub fn deserialize(raw: u16) -> Option<NonZeroU16> {
    if raw > 0xffe {
        return MaybeValue(None);
    }
    NonZeroU16::new(raw + 1)
}

example::deserialize:
        uxth    r1, r0
        adds    r0, #1
        movw    r2, #4094
        cmp     r1, r2
        it      hi
        movhi   r0, #0
        bx      lr

Other than LLVM randomly inverting the conditional that it generates for some reason, this looks pretty similar to the deserialize we saw previously, but with one difference. While we had to do “more” work, we actually shaved an instruction: basically, the code always has to “move” the value to its final location (either in r1 as before, or as r0 = r0 + 1 in the latter case), but “add 1” is pretty much the same as a move instruction, so that comes for free. Because the newer version is able to optimize the representation such that it only uses r0, it doesn’t have to spend that extra instruction setting up r1. As a bonus, std::mem::size_of::<MaybeValue>() = 2 bytes–we got our space back!

How about the cost-to-access?

pub fn mapping_use(v: Option<NonZeroU16>) -> Option<NonZeroU16> {
    v.and_then(|shifted| {
        let normal = shifted.get() - 1;
        let result = 2 * normal + 1;
        NonZeroU16::new(result + 1)
    })
}

pub fn default_use(v: Option<NonZeroU16>) -> u16 {
    2 * v.map_or(0, |shifted| shifted.get() - 1) + 1
}

example::mapping_use:
        lsls    r0, r0, #1
        bx      lr

example::default_use:
        mov.w   r1, #-1
        add.w   r1, r1, r0, lsl #1
        lsls    r0, r0, #16
        it      eq
        moveq   r1, #1
        mov     r0, r1
        bx      lr

Well then… one of those is certainly unexpected!

default_use looks pretty familiar–we have some extra work going on, but the instruction count comes out the same as before. Some of this has to do with ARM being pretty awesome and my choice of operation being a bit lucky: the lsls r1, r1, #1 operation from the work above simply became add.w r1, r1, r0, lsl #1–subtracting 1 and accomplishing the shift at the same time. The rest of the basic structure is the same, though.

mapping_use surprised me! It turns out that I got even luckier with this one. Given the scaling I chose we wound up with:

`r0` value	logical value	logical result	`r0` return
0	None	None	0
1	0	1	2
2	1	3	4
3	2	5	6

And so on. That turns into a single instruction!

Less Cheating

Ok, while I didn’t chose those values thinking that it would simplify down to something so fancy in advance, it’s some impressive work by LLVM. Let’s try something else that has less chance of bias–scaling into volts. I know I pooh poohed floats above, but taking ADC counts -> float is still a reasonable transform to want to do. I’ll just throw all the samples at you at once:

pub fn to_volts(v: u16) -> f32 {
    (v as f32) * 3.3 / 4095.0
}

pub fn naked(v: u16) -> Option<f32> {
    if v > 0xffe {
        return None;
    }
    Some(to_volts(v))
}

pub fn raw_mapped(v: Option<u16>) -> Option<f32> {
    v.map(to_volts)
}

pub fn raw_defaulted(v: Option<u16>) -> f32 {
    to_volts(v.unwrap_or(0))
}

pub fn nz_mapped(v: Option<NonZeroU16>) -> Option<f32> {
    v.and_then(|shifted| Some(to_volts(shifted.get() - 1)))
}

pub fn nz_defaulted(v: Option<NonZeroU16>) -> f32 {
    v.map_or(0.0, |shifted| to_volts(shifted.get() - 1))
}

This yields:

example::naked:
        uxth    r0, r0
        movw    r1, #4094
        cmp     r0, r1
        bls     .LBB5_2
        movs    r0, #0
        bx      lr
.LBB5_2:
        vmov    s0, r0
        vldr    s2, .LCPI5_0
        vcvt.f32.u32    s0, s0
        vldr    s4, .LCPI5_1
        vmul.f32        s0, s0, s2
        vdiv.f32        s0, s0, s4
        movs    r0, #1
        bx      lr
.LCPI5_0:
        .long   1079194419
.LCPI5_1:
        .long   1166012416

example::raw_mapped:
        lsls    r0, r0, #16
        beq     .LBB1_2
        uxth    r0, r1
        vmov    s0, r0
        vldr    s2, .LCPI1_0
        vcvt.f32.u32    s0, s0
        vldr    s4, .LCPI1_1
        vmul.f32        s0, s0, s2
        vdiv.f32        s0, s0, s4
        movs    r0, #1
        bx      lr
.LBB1_2:
        movs    r0, #0
        bx      lr
.LCPI1_0:
        .long   1079194419
.LCPI1_1:
        .long   1166012416

example::raw_defaulted:
        uxth    r0, r0
        cmp     r0, #0
        it      eq
        moveq   r1, r0
        uxth    r0, r1
        vmov    s0, r0
        vldr    s2, .LCPI2_0
        vcvt.f32.u32    s0, s0
        vldr    s4, .LCPI2_1
        vmul.f32        s0, s0, s2
        vdiv.f32        s0, s0, s4
        bx      lr
.LCPI2_0:
        .long   1079194419
.LCPI2_1:
        .long   1166012416

example::nz_mapped:
        lsls    r1, r0, #16
        beq     .LBB3_2
        subs    r0, #1
        uxth    r0, r0
        vmov    s0, r0
        vldr    s2, .LCPI3_0
        vcvt.f32.u32    s0, s0
        vldr    s4, .LCPI3_1
        vmul.f32        s0, s0, s2
        vdiv.f32        s0, s0, s4
        movs    r0, #1
        bx      lr
.LBB3_2:
        movs    r0, #0
        bx      lr
.LCPI3_0:
        .long   1079194419
.LCPI3_1:
        .long   1166012416

example::nz_defaulted:
        lsls    r1, r0, #16
        beq     .LBB4_2
        subs    r0, #1
        uxth    r0, r0
        vmov    s0, r0
        vldr    s2, .LCPI4_1
        vcvt.f32.u32    s0, s0
        vldr    s4, .LCPI4_2
        vmul.f32        s0, s0, s2
        vdiv.f32        s0, s0, s4
        bx      lr
.LBB4_2:
        vldr    s0, .LCPI4_0
        bx      lr
.LCPI4_0:
        .long   0
.LCPI4_1:
        .long   1079194419
.LCPI4_2:
        .long   1166012416

And a brief analysis:

Version	Total insns	`None` path	`Some` path	Critical path
naked	14	6	12	12
raw mapped	13	4	11	11
raw defaulted	12	12	12	12
nz mapped	14	4	12	12
nz defaulted	13	4	11	11

I’m generally going to call this a wash. There is an extra subs instruction in the nz_mapped path vs. the raw_mapped path, but raw_defaulted ends up spending one more instruction shuffling data around than nz_defaulted. At this point we start to leave the realm of “obvious” and are well into micro-architectural issues; analyzing in any further depth is likely unwarranted.

Conclusion

We can leverage Rust’s null pointer abstraction in places where you wouldn’t immediately expect it would be possible. Doing so sometimes takes a bit of extra work, but that work doesn’t necessarily make things more expensive. As with all things optimization related, measure your specific use case.