C++: A jump table with a template device

A few articles ago we saw how gcc might need some help when mixing template instanciation (pure compile time data) with function calls (deducible compile time information, but not available to the template expander). Now we’ll go one step further and combine all three types: pure compile time data, deducible compile time data and pure run time data (*). Just to annoy the compiler, and to see how gcc is able to optimize the results.

Let’s build a simple example, similar to what we used last time: an object that will determine the range of an integer and then invoke a callback with the closest range. Something like this could be used, for example, to allocate a buffer.

void boring(int x, func f) {
    if (x < 2) {
        f(2);
    } else if (x < 4) {
        f(4);
    } else if (x < 8) {
        f(8);
    } else if (x < 16) {
        // You get the idea...
    }
}

Can we build a prettier template version of this code, without any overhead? Let’s try:

typedef void (*func)(int);

template <int My_Size>
struct Foo {
    void bar(size_t size, func callback) {
        if (size > My_Size) {
            callback(My_Size);
        } else {
            next_foo.bar(size, callback);
        }
    }

    Foo<My_Size/2> next_foo;
};

// Stop condition
template<> struct Foo<0> {
    void bar(size_t, func) { }
};

void wrapper(int x, func f) {
    Foo<512> jump_table;
    jump_table.bar(x, f);
}

And now, let’s compile like as “g++ -fverbose-asm -S -O0 -c foo.cpp -o /dev/stdout | c++filt”. You’ll see something like this:

wrapper(int, void (*)(int)):
    call    Foo<512>::bar(unsigned long, void (*)(int))

Foo<512>::bar(unsigned long, void (*)(int)):
    cmpq    $512, %rsi    #, size
    jbe    .L4
    call    *%rdx    # callback
    jmp    .L3
.L4:
    call    Foo<256>::bar(unsigned long, void (*)(int))    #
.L3:
    leave

Foo<256>::bar(unsigned long, void (*)(int)):
    cmpq    $256, %rsi    #, size
    jbe    .L4
    call    *%rdx    # callback
    jmp    .L3
.L4:
    call    Foo<128>::bar(unsigned long, void (*)(int))    #
.L3:
    leave

# You get the idea, right?

Foo<0>::bar(unsigned long, void (*)(int)):
    # Stop condition, do nothing

That doesn’t look too good, does it? We don’t need to worry: we already learned that gcc needs help from the optimizer to handle template expansion and non static function calls. Let’s move to O1:

rapper(int, void (*)(int)):
.LFB14:
    cmpq    $512, %rdi    #, D.2974
    jbe    .L2    #,
    movl    $512, %edi    #,
    call    *%rsi    # f
    jmp    .L1    #
.L2:
    cmpq    $256, %rdi    #, D.2974
    jbe    .L4    #,
    movl    $256, %edi    #,
    call    *%rsi    # f
    jmp    .L1    #

# Again, it should be clear what's going on...

.L11:
    cmpq    $1, %rdi    #, D.2974
    .p2align 4,,2
    jbe    .L1    #,
    movl    $1, %edi    #,
    .p2align 4,,2
    call    *%rsi    # f
.L1:

It’s better than last time, but it doesn’t look great either: gcc managed to inline all calls, but it stopped there. Let’s move to O2 and see what happens:

wrapper(int, void (*)(int)):
    movslq    %edi, %rdi    # x, D.2987
    cmpq    $512, %rdi    #, D.2987
    ja    .L13    #,
    cmpq    $256, %rdi    #, D.2987
    ja    .L14    #,
    [ .... ]
    cmpq    $2, %rdi    #, D.2987
    ja    .L21    #,

.L13:
    movl    $512, %edi    #,
    jmp    *%rsi    # f

.L14:
    movl    $256, %edi    #,
    jmp    *%rsi    # f

[ .... ]

.L21:
    movl    $2, %edi    #,
    jmp    *%rsi    # f

.L1:
    rep
    ret
    .p2align 4,,10
    .p2align 3

Now, that looks much better. And we can now see that gcc generates the same code at -O2 for both versions of our code.

(*) Just for the sake of completion:

  • Pure compile time data is information directly available during compilation time, like a constant.
  • Deducible compile time data means something that can easily be deduced, like a function call to a non virtual method.
  • Run-time only data means something that a compiler could never deduce, like a volatile variable or the parameter of a function called from outside the current translation unit.
Advertisements

8 Comments on “C++: A jump table with a template device”

  1. ploxiln says:

    Maybe I’m being dumb, but I don’t think

        if (size > My_Size) {
            callback(My_Size);
    

    is the same logic as boring(). it would have to be something like

        if (size >= My_Size/2)  {
            callback(My_Size);
    

    (or refactored more significantly)

  2. Anonymous says:

    This is not a jump table — the resulting code still contains a chain of conditional jumps. A jump table uses an array of function pointers and a given value is used as an index into this array. In general, your approach will need N comparisons for N if cases whereas a true jump table will just do a one memory access and a jump to a register value regardless of a number of cases.

  3. Rob G says:

    I really don’t see how anyone sane could regard the template version as “prettier”.

    • It is very much a matter of personal opinion, but your point of view can change rapidly once you discover that someone typo’d a 32 for a 23 and you had to spend a day trying to figure out why some of the buckets are broken 🙂

  4. robdesbois says:

    For interest’s sake: is there a particular reason to prefer power-of-2 buffer sizes over exponentially growing from a non-power? I.e. if initial x is 3, is it really better to calculate the next power so you can allocate 4,8,16, 32… than just 3,6,12,24…?

    • Nothing more to add to ploxiln’s comment. I initially worked on this snippet because I had to do some bucketing similar to what jemalloc does (I’m not sure how jemalloc implements their buckets, though)

  5. ploxiln says:

    Re: why power-of-2:

    A natural checkpoint in the sizing is 4KiB (which is a power of 2), because that’s a memory page size (for most architectures) (because they can just mask the lower 12 bits to identify the page). The OS actually allocates memory to a process at the finest granularity of 4KiB. That’s also a good point to switch allocation strategies, maybe use mmap() to get a stand-alone chunk of memory, instead of something like sbrk() to extend the old-style heap region.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s