Fixing templates with constexpr’s

For my hundredth (and a bit) c++ post I decided to do something I never did before: fix my old code!

A long time ago I wrote about template metaprogramming devices. There, I tried to explain that many atrocities have been commited in the name of performance and “compile time evaluation”. Template metaprogramming is probably one of the worse culprits of job security. Its (ab)use can create monstrosities, all in the name of runtime performance. Like, for example, my template device to calculate e. Let’s remember what that atrocious code looks like (follow the link if you want an explanation on how this works):

template <int N, int D> struct Frak {
        static const long Num = N;
        static const long Den = D;
};
  
template <int N, typename F> struct ScalarMultiplication {
    typedef Frak<N*F::Num, N*F::Den> result;
};
  
template <int X, int Y>   struct MCD {
        static const long result = MCD<Y, X % Y>::result;
};
  
template <int X> struct MCD<X, 0> {
        static const long result = X;
};
  
template <class F> struct Simpl {
        static const long mcd = MCD<F::Num, F::Den>::result;
        typedef Frak< F::Num / mcd, F::Den / mcd > result;
};
  
template <typename X1, typename Y1> struct SameBase {
    typedef typename ScalarMultiplication< Y1::Den, X1>::result X;
    typedef typename ScalarMultiplication< X1::Den, Y1>::result Y;
};
  
template <typename X, typename Y> struct Sum {
    typedef SameBase<X, Y> B;
    static const long Num = B::X::Num + B::Y::Num;
    static const long Den = B::Y::Den; // == B::X::Den
    typedef typename Simpl< Frak<Num, Den> >::result result;
};

template <int N> struct Fact {
    static const long result = N * Fact<N-1>::result;
};
template <> struct Fact<0> {
    static const long result = 1;
};
 
template <int N> struct E {
    // e = S(1/n!) = 1/0! + 1/1! + 1/2! + ...
    static const long Den = Fact<N>::result;
    typedef Frak< 1, Den > term;
    typedef typename E<N-1>::result next_term;
    typedef typename Sum< term, next_term >::result result;
};
template <> struct E<0> {
    typedef Frak<1, 1> result;
};
 
int main() {
  typedef E<8>::result X;
  std::cout << "e = " << (1.0 * X::Num / X::Den) << "\n";
  std::cout << "e = " << X::Num <<"/"<< X::Den << "\n";
  return 0;
}

While this is just a toy example to play with templates, it does illustrate code I’ve seen in the wild. Would this look cleaner in c++11? Yes, it would. Constexprs are, in my opinion, one of the most overlooked “killer” features of c++11.

Starting with a simple example:

constexpr int foo(int a, int b) { return a+b; }
static constexpr int n = foo(1, 2);
int bar() { return n; }

Try to compile it with “g++ -std=c++11 -fverbose-asm -O0 -c -S -o /dev/stdout” and see what happens. You should get the equivalent of “return 3” – just as anyone would expect – but note that no optimizations were enabled. What about loops? Let’s try this:

constexpr int f(int n) {
    return (n<2)? 1 : n + f(n-1);
}

constexpr int n = f(999);

You’ll probably get an error about maximum depth exceeded, but that’s alright: we have loops in constexprs too! (note that some of these restrictions have been relaxed in C++17).

In general, if you can express your function as a single const return statement, it should be a valid constexpr. With this new knowledge, let’s convert the template meta-atrocity above to something a bit less hideous:

struct PodFrac {
    int num;
    int den;
};

constexpr int mcd(int a, int b) {
    return (b==0)? a : mcd(b, a%b);
}

constexpr PodFrac simpl(const PodFrac &f) {
    return PodFrac{f.num / mcd(f.num, f.den), f.den / mcd(f.num, f.den)};
}

constexpr PodFrac sum(const PodFrac &a, const PodFrac &b) {
    return simpl(PodFrac{a.num*b.den + b.num*a.den, a.den*b.den});
}

constexpr int fact(int n) {
    return (n==0)? 1 : n*fact(n-1);
}

constexpr PodFrac e(int n) {
    return (n==0)? PodFrac{1, 1} :
                   sum(PodFrac{1, fact(n)}, e(n-1));
}

constexpr float e_num = 1.0 * e(8).num / e(8).den;

float get_e() {
    return e_num;
}

Disclaimer: while I explicitly stated this multiple times in my “C++ template metaprogramming introduction” article, it’s worth re-stating it: this code is meant as an example to showcase a c++ feature, not as a proper way of deriving a mathematical constant in production code.

First thoughts after comparing the two versions: much, much [, much]*100 cleaner.

As you may notice, all constexprs need to be a return statement. There’s no multi-statement constexpr in c++11, which explains why loops are not really supported. For the same reason the implementation of e() is a bit hindered by this limitation: its code would be much more readable splitting it in a few lines with proper names. Good news: some of these restrictions have been relaxed in C++17.

Note that if you analyze your compiler’s output when building without optimizations, you may see either a const with e’s value, or a static initializer that does some trivial operation, like loading e’s value from a fraction: gcc seems to get tired of constexpr evaluation after a few recursive calls, so your results may vary (slightly).

I called constexpr’s one of c++11’s killer features, and hopefully you can see why I’m so enthusiastic about them now: there’s much less incentive for people to write horrible template metaprogramming devices when simply adding a little keyword to a normal function has the same effect, only cleaner.

Advertisements

C++: A jump table with a template device

A few articles ago we saw how gcc might need some help when mixing template instanciation (pure compile time data) with function calls (deducible compile time information, but not available to the template expander). Now we’ll go one step further and combine all three types: pure compile time data, deducible compile time data and pure run time data (*). Just to annoy the compiler, and to see how gcc is able to optimize the results.

Let’s build a simple example, similar to what we used last time: an object that will determine the range of an integer and then invoke a callback with the closest range. Something like this could be used, for example, to allocate a buffer.

void boring(int x, func f) {
    if (x < 2) {
        f(2);
    } else if (x < 4) {
        f(4);
    } else if (x < 8) {
        f(8);
    } else if (x < 16) {
        // You get the idea...
    }
}

Can we build a prettier template version of this code, without any overhead? Let’s try:

typedef void (*func)(int);

template <int My_Size>
struct Foo {
    void bar(size_t size, func callback) {
        if (size > My_Size) {
            callback(My_Size);
        } else {
            next_foo.bar(size, callback);
        }
    }

    Foo<My_Size/2> next_foo;
};

// Stop condition
template<> struct Foo<0> {
    void bar(size_t, func) { }
};

void wrapper(int x, func f) {
    Foo<512> jump_table;
    jump_table.bar(x, f);
}

And now, let’s compile like as “g++ -fverbose-asm -S -O0 -c foo.cpp -o /dev/stdout | c++filt”. You’ll see something like this:

wrapper(int, void (*)(int)):
    call    Foo<512>::bar(unsigned long, void (*)(int))

Foo<512>::bar(unsigned long, void (*)(int)):
    cmpq    $512, %rsi    #, size
    jbe    .L4
    call    *%rdx    # callback
    jmp    .L3
.L4:
    call    Foo<256>::bar(unsigned long, void (*)(int))    #
.L3:
    leave

Foo<256>::bar(unsigned long, void (*)(int)):
    cmpq    $256, %rsi    #, size
    jbe    .L4
    call    *%rdx    # callback
    jmp    .L3
.L4:
    call    Foo<128>::bar(unsigned long, void (*)(int))    #
.L3:
    leave

# You get the idea, right?

Foo<0>::bar(unsigned long, void (*)(int)):
    # Stop condition, do nothing

That doesn’t look too good, does it? We don’t need to worry: we already learned that gcc needs help from the optimizer to handle template expansion and non static function calls. Let’s move to O1:

rapper(int, void (*)(int)):
.LFB14:
    cmpq    $512, %rdi    #, D.2974
    jbe    .L2    #,
    movl    $512, %edi    #,
    call    *%rsi    # f
    jmp    .L1    #
.L2:
    cmpq    $256, %rdi    #, D.2974
    jbe    .L4    #,
    movl    $256, %edi    #,
    call    *%rsi    # f
    jmp    .L1    #

# Again, it should be clear what's going on...

.L11:
    cmpq    $1, %rdi    #, D.2974
    .p2align 4,,2
    jbe    .L1    #,
    movl    $1, %edi    #,
    .p2align 4,,2
    call    *%rsi    # f
.L1:

It’s better than last time, but it doesn’t look great either: gcc managed to inline all calls, but it stopped there. Let’s move to O2 and see what happens:

wrapper(int, void (*)(int)):
    movslq    %edi, %rdi    # x, D.2987
    cmpq    $512, %rdi    #, D.2987
    ja    .L13    #,
    cmpq    $256, %rdi    #, D.2987
    ja    .L14    #,
    [ .... ]
    cmpq    $2, %rdi    #, D.2987
    ja    .L21    #,

.L13:
    movl    $512, %edi    #,
    jmp    *%rsi    # f

.L14:
    movl    $256, %edi    #,
    jmp    *%rsi    # f

[ .... ]

.L21:
    movl    $2, %edi    #,
    jmp    *%rsi    # f

.L1:
    rep
    ret
    .p2align 4,,10
    .p2align 3

Now, that looks much better. And we can now see that gcc generates the same code at -O2 for both versions of our code.

(*) Just for the sake of completion:

  • Pure compile time data is information directly available during compilation time, like a constant.
  • Deducible compile time data means something that can easily be deduced, like a function call to a non virtual method.
  • Run-time only data means something that a compiler could never deduce, like a volatile variable or the parameter of a function called from outside the current translation unit.

gcc: Optimization levels and templates

Analyzing the assembly output for template devices can be a bit discouragging at times, specially when we spend hours trying to tune a mean looking template class only to find out the compiler is not able to reduce it’s value like we expected. But hold on, before throwing all your templates away you might want to figure out why they are not optimized.

Let’s start with a simple example: a template device to return the next power of 2:

template <int n, long curr_pow, bool stop>
struct Impl_Next_POW2 {
    static const bool is_smaller = n < curr_pow;
    static const long next_pow = _Next_POW2<n, curr_pow*2, is_smaller>::pow;
    static const long pow = is_smaller? curr_pow : next_pow;
};

template <int n, long curr_pow>
struct Impl_Next_POW2<n, curr_pow, true> {
    // This specializtion is important to stop the expansion
    static const long pow = curr_pow;
};

template <int n>
struct Next_POW2 {
    // Just a wrapper for _Next_POW2, to hide away some
    // implementation details
    static const long pow = _Next_POW2<n, 1, false>::pow;
};

Gcc can easily optimize that away, if you compile with “g++ foo.cpp -c -S -o /dev/stdout” you’ll just see the whole thing is replaced by a compile time constant. Let’s make gcc’s life a bit more complicated now:

template <int n, long curr_pow, bool stop>
struct Impl_Next_POW2 {
    static long get_pow() {
        static const bool is_smaller = n < curr_pow;
        return is_smaller?
                    curr_pow :
                    _Next_POW2<n, curr_pow*2, is_smaller>::get_pow();
    }
};

template <int n, long curr_pow>
struct Impl_Next_POW2<n, curr_pow, true> {
    static long get_pow() {
        return curr_pow;
    }
};

template <int n>
struct Next_POW2 {
    static long get_pow() {
        return _Next_POW2<n, 1, false>::get_pow();
    }
};

Same code but instead of using plain static values we wrap everything in a method. Compile with “g++ foo.cpp -c -S -fverbose-asm -o /dev/stdout | c++filt” and you’ll see something like this now:

main:
    call    Next_POW2<17>::get_pow()

Next_POW2<17>::get_pow():
    call    _Next_POW2<17, 1l, false>::get_pow()

_Next_POW2<17, 1l, false>::get_pow():
    call    _Next_POW2<17, 2l, false>::get_pow()

_Next_POW2<17, 2l, false>::get_pow():
    call    _Next_POW2<17, 4l, false>::get_pow()

_Next_POW2<17, 4l, false>::get_pow():
    call    _Next_POW2<17, 8l, false>::get_pow()

_Next_POW2<17, 8l, false>::get_pow():
    call    _Next_POW2<17, 16l, false>::get_pow()

_Next_POW2<17, 16l, false>::get_pow():
    call    _Next_POW2<17, 32l, false>::get_pow()

_Next_POW2<17, 32l, false>::get_pow():
    movl    $32, %eax    #, D.2171

What went wrong? It’s very clear for us the whole thing is just a chain of calls which could be replaced by the last one, however that information is now only available if you “inspect” the body of each function, and this is something the template instanciator (at least in gcc) can’t do. Luckily you just need to enable optimizations, -O1 is enough, to have gcc output the reduced version again.

Keep it in mind for the next time you’re optimizing your code with template metaprogramming: some times the template expander needs some help from the optimizer too.


A C++ template device to obtain an underlying type

What happens when you need to get the underlying data type of a pointer or reference? You can write some crazy metaprogram to do it for you. Like this:


template <typename T> struct get_real_type      { typedef T type; };
template <typename T> struct get_real_type<T*>  { typedef T type; };
template <typename T> struct get_real_type<T&>  { typedef T type; };

template <class T>
int foo() {
    return get_real_type<T>::type::N;
}

struct Bar {
    static const int N=24;
};

#include <iostream>
using namespace std;
int main() {
    cout << foo<Bar*>() << endl;
    cout << foo<Bar&>() << endl;
    cout << foo<Bar>() << endl;
}

Incidentally, this is also the basis for the implementation of std::remove_reference. Actually you’d be better of using std::remove_reference, for your own sanity.


Useless code: a template device to calculate e

Recently I needed to flex a bit my template metaprogrammingfooness, so I decided to go back and review and old article I wrote about it (C++11 made some parts of those articles obsolete, but I’m surprised of how well it’s aged). To practice a bit I decided to tackle a problem I’m sure no one ever had before: defining a mathematical const on compile time. This is what I ended up with, do you have a better version? Shouldn’t be to hard.

template <int N, int D> struct Frak {
static const long Num = N;
static const long Den = D;
};

template <class X, int N> struct MultEscalar {
typedef Frak< N*X::Num, N*X::Den > result;
};

template <class X1, class Y1> struct IgualBase {
typedef typename MultEscalar< X1, Y1::Den >::result X;
typedef typename MultEscalar< Y1, X1::Den >::result Y;
};

template <int X, int Y> struct MCD {
static const long result = MCD<Y, X % Y>::result;
};
template <int X> struct MCD<X, 0> {
static const long result = X;
};

template <class F> struct Simpl {
static const long mcd = MCD<F::Num, F::Den>::result;
typedef Frak< F::Num / mcd, F::Den / mcd > result;
};

template <class X, class Y> struct Suma {
typedef IgualBase<X, Y> B;
static const long Num = B::X::Num + B::Y::Num;
static const long Den = B::Y::Den; // == B::X::Den
typedef typename Simpl< Frak<Num, Den> >::result result;
};

template <int N> struct Fact {
static const long result = N * Fact<N-1>::result;
};
template <> struct Fact<0> {
static const long result = 1;
};

template <int N> struct E {
// e = S(1/n!) = 1/0! + 1/1! + 1/2! + …
static const long Den = Fact<N>::result;
typedef Frak< 1, Den > term;
typedef typename E<N-1>::result next_term;
typedef typename Suma< term, next_term >::result result;
};
template <> struct E<0> {
typedef Frak<1, 1> result;
};

#include <iostream>
int main() {
typedef E<8>::result X;
std::cout << "e = " << (1.0 * X::Num / X::Den) << "\n";
std::cout << "e = " << X::Num <<"/"<< X::Den << "\n";
return 0;
}


Cool C++0X features XIII: auto and ranged for, cleaner loops FTW

Long time without updating this series. Last time we saw how the ugly

for (FooContainer::const_iterator i = foobar.begin(); i != foobar.end(); ++i)

could be transformed into the much cleaner

for (auto i = foobar.begin(); i != foobar.end(); ++i)

Yet we are not done, we can clean that a lot more using for range statements.

Ranged for is basically syntactic sugar (no flamewar intended) for shorter for statements. It’s nothing new and it’s been part of many languages for many years already, so there will be no lores about the greatness of C++ innovations (flamewar intended), but it still is a nice improvement to have, considering how tedious can be to write nested loops. This certainly looks much cleaner:

for (auto x : foobar)

This last for-statement, even though it looks good enough to print and hang in a wall, raises a lot of questions. What’s the type of x? What if I want to change its value? Let’s try to answer that.

The type of the iterator will be the same as the type of the vector, so in this case x would be an int:

std::vector foobar;
for (auto x : foobar) {
	std::cout &lt;&lt; (x+2);
}

And now, what happens if you want to alter the contents of a list and not only display them? That’s easy too, just declare x as an auto reference:

std::vector foobar;
for (auto&amp; x : foobar) {
	std::cout &lt;&lt; (x+2);
}

This looks really nice but it won’t really do anything, for two different reasons:
* Ranged fors won’t work until g++ 4.5.6 is released
* The list is empty!

There are many ways to initialize that list, but we’ll see how C++0X let’s you do it in a new way the next time.


stlfilt: read ugly tmpl errors

There’s nothing better for Monday mornings than the smell of hundreds of template errors after a make clean all. When using template metaprogramming, a tiny misplaced coma can generate enough error code that, if printed, would crush you under tones of paper. And don’t even try to read them, it’ll make your head explode.

Luckily STLFilt can be quite a relief when dealing with this kind of errors. Granted, it won’t make a steaming pile of poo seem to be a nice poem, but if you have something like the dog in the picture, to use a metaphor, at least it would put a blanket on its face.


Cool C++0X features XII: type inference with auto

In the last four entries we worked on a simple example, like the one I’m pasting below, of type inference with decltype, which led us to learn about delayed type declaration and decltypes with auto. This time I want to focus just on the auto keyword instead.

template <class... Args>
auto wrap(Args... a) -> decltype( do_something(a...) ) {
	std::cout << __PRETTY_FUNCTION__ << "n";
	return do_something(a...);
}

We saw last time how decltype can be used in a contrived way to create a local variable without specifying its type, only how to deduce the type for this variable. Luckily, that verbose method of type declaration can be summed up in the following way:

	int x = 2;
	int y = 3;
	decltype(x*y) z = x*y;
	int x = 2;
	int y = 3;
	auto z = x*y;

That’s right, when you are declaring local variables it’s easier and cleaner to just use auto. This feature isn’t even “in the wild” yet, so you can’t really predict what will people do with it, but it seems to me that limiting its use to local variables with a very short lived scope is the best strategy. We are yet to see what monstrosities the abuse of this feature will produce, and I’m sure there will be many. Regardless of their potential to drive insane any maintainers, its best use probably comes in loops.

In any C++ application, you’ll find code like this:

for (FooContainer<Bar>::const_iterator i = foobar.begin(); i != foobar.end(); ++i)

This ugly code can be eliminated with something much more elegant:

for (auto i = foobar.begin(); i != foobar.end(); ++i)

Looks nicer indeed, but we can improve it much further with other tools. We’ll see how the next time. For the time being, let’s see for what auto is not to be used.

When using auto, keep in mind it was designed to simplify the declaration of a variable with a complex or difficult to reason type, not as a replacement for other language features like templates. This is a common mistake:

Wrong

void f(auto x) {
	cout << x;
}
Right

template <T>
void f(T x) {
	cout << x;
}

It makes no sense to use auto in the place of a template, since a template means that the type will be completed later whereas auto means it should be deduced from an initializer.


Cool C++0X features X: type inference with decltype

After creating a wrapper object on the last entries, we were left with three syntax changes two analyze:

We already saw the first, and we’ll be talking about the other two this time. This was the original wrapper function which led us here:

template <class... Args>
auto wrap(Args... a) -> decltype( do_something(a...) ) { 
	std::cout << __PRETTY_FUNCTION__ << "n";
	return do_something(a...);
}

Back on topic:

decltype
This operator (yes, decltype is an operator) is a cousin of sizeof which will yield the type of an expression. Why do I say it’s a cousin of sizeof? Because it’s been in the compilers for a long time, only in disguise. This is because you can’t get the size of an expression without knowing it’s type, so even though it’s implementation has existed for a long time only now it’s available to the programmer.

One of it’s interesting features is that the expression with which you call decltype won’t be evaluated, so you can safely use a function call within a decltype, like this:

auto foo(int x) -> decltype( bar(x) ) { 
	return bar(x);
}

Doing this with, say, a macro, would get bar(x) evaluated twice, yet with decltype it will be evaluated only once. Any valid C++ expression can go within a decltype operator, so for example this is valid too:

template <typename A, typename B>
auto multiply(A x, B y) -> decltype( x*y )
{ 
	return x*y;
}

What’s the type of A and B? What’s the type of A*B? We don’t care, the compiler will take care of that for us. Let’s look again at that example, more closely:

-> (delayed declaration) and decltype

Why bother creating a delayed type declaration at all and not just use the decltype in place of the auto? That’s because of a scope problem, see this:

// Declare a template function receiving two types as param
template <typename A, typename B>
// If we are declaring a multiplication operation, what's the return type of A*B?
// We can't multiply classes, and we don't know any instances of them
auto multiply(A x, B y)
// Luckily, the method signature now defined both parameters, meaning
// we don't need to expressly know the type of A*B, we just evaluate
// x*y and use whatever type that yields
	-> decltype( x*y )
{ 
	return x*y;
}

decltype
As you see, decltype can be a very powerful tool if the return type of a function is not known for the programmer when writing the code, but you can use it to declare any type, anywhere, if you are too lazy to type. If you, for example, are very bad at math and don’t remember that the integers group is closed for multiplication, you could write this:

	int x = 2;
	int y = 3;
	decltype(x*y) z = x*y;

Yes, you can use it as VB’s dim! (kidding, just kidding, please don’t hit me). Even though this works and it’s perfectly legal, auto is a better option for this. We’ll see that on the next entry.


Cool C++0X features IX: delayed type declaration

In the last two entries we worked on a wrapper object which allows us to decorate a method before or after calling (hello aspects!), or at least that’s what it should do when g++ fully implements decltypes and variadic templates. Our wrapper function looks something like this (check out the previous entry for the wrapper object):

#include <iostream>

void do_something() { std::cout << __PRETTY_FUNCTION__ << "n"; }
void do_something(const char*) { std::cout << __PRETTY_FUNCTION__ << "n"; }
int do_something(int) { std::cout << __PRETTY_FUNCTION__ << "n"; return 123; }

template <class... Args>
auto wrap(Args... a) -> decltype( do_something(a...) ) { 
	std::cout << __PRETTY_FUNCTION__ << "n";
	return do_something(a...);
}

int main() {
	wrap();
	wrap("nice");
	int x = wrap(42);
	std::cout << x << "n";
	return 0;
}

After the example, we were left with three new syntax changes to analyze:

  • -> (delayed declaration)
  • decltype
  • auto

Let’s study the -> operator this time:

-> (delayed declaration)
This is the easiest one. When a method is declared auto (I’ve left this one for the end because auto is used for other things too) it means its return type will be defined somewhere else. Note that in this regard the final implementation differs from Stroustroup’s FAQ.

The -> operator in a method’s definition says “Here’s the return type”. I’ll paste the same simple example we had last time, the following two snippets of code are equivalent:

void foo() { 
}
auto foo() -> void { 
}