#### HOME

### Computing the Hamming weight

#### Exploring performance and interoperability between Erlang, Golang and C

I was investigating how easy/difficult it is to call raw C functions from Erlang or Go so I wrote some notes here. I used the Hamming weight for my benchmark as it is self-contained and allowed me to make use of Erlang superb binary handling.The Hamming weight (also known as population count or popcount) of a binary (a sequence of zeroes and ones) is simply its total number of up-bits (ones). There exists different methods to compute the Hamming weight, all having different levels of sophistication and efficiency. This post contrasts the performance of different methods in Erlang, Go and C, as well as the performance when calling the C function from the two higher level languages.

The most naive method as one would expect is to look at the bits one at the time and sum them. In Erlang's functional, pattern-matching-heavy fashion, you recurse over the tail of the binary while keeping count with an accumulator variable:

1 2 3 4 5 | ```
weight(Bitstring) -> weight(Bitstring, 0).
weight(<<>>, Acc) -> Acc;
weight(<<0:1, Tail/bits>>, Acc) -> weight(Tail, Acc);
weight(<<1:1, Tail/bits>>, Acc) -> weight(Tail, Acc+1).
``` |

The other method uses clever bitwise manipulation to leverage the speed of assembler instructions. The gist of this method is a divide-and-conquer approach where pre-computed binary masks are applied to the binary of interest and the count of up-bits in each sub-binary is summed in a tree pattern. Let's do a simplistic example for 8-bits integer in pseudocode (I use // for comments):

input: x, an 8-bit integer // for this example, let x be 10110011 let a = x & 01010101 // a is 00010001 let b = (x>>1) & 01010101 // b is 01011001 & 01010101 -> 01010001 let c = a + b // c is 00010001 + 01010001 -> 01100010 let d = c & 00110011 // d is 01100010 & 00110011 -> 00100010 let e = (c>>2) & 00110011 // e is 00011000 & 00110011 -> 00010000 let f = d + e // f is 00100010 + 00010000 -> 00110010 let g = f & 00001111 // g is 00110010 & 00001111 -> 00000010 let h = (f>>4) & 00001111 // h is 00000011 & 00001111 -> 00000011 let i = g + h // i is 00000010 + 00000011 -> 00000101In base 10, i is 5 and the answer we are looking for. The C code below do the same trick as above, except with 64-bit binaries:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ```
unsigned int weight(unsigned long long input) {
unsigned long long mask1, mask2, mask3, mask4, mask5, mask6;
mask1 = 6148914691236517205ULL; // 01010101...
mask2 = 3689348814741910323ULL; // 00110011...
mask3 = 1085102592571150095ULL; // 00001111...
mask4 = 71777214294589695ULL; // 8 zeroes, 8 ones, etc...
mask5 = 70367670468607ULL; // 16 zeroes, 16 ones, etc...
mask6 = 4294967295ULL; // 32 zeroes, 32 ones
input = (input & mask1) + ((input>>1) & mask1);
input = (input & mask2) + ((input>>2) & mask2);
input = (input & mask3) + ((input>>4) & mask3);
input = (input & mask4) + ((input>>8) & mask4);
input = (input & mask5) + ((input>>16) & mask5);
input = (input & mask6) + ((input>>32) & mask6);
return (unsigned int) input;
}
``` |

Golang, with its very C-like syntax, is almost a copy of the above:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ```
func popcount(input uint64) uint8 {
var mask1, mask2, mask3, mask4, mask5, mask6 uint64
mask1 = 6148914691236517205 // 01010101...
mask2 = 3689348814741910323 // 00110011...
mask3 = 1085102592571150095 // 00001111...
mask4 = 71777214294589695 // 8 zeroes, 8 ones, etc...
mask5 = 70367670468607 // 16 zeroes, 16 ones, etc...
mask6 = 4294967295 // 32 zeroes, 32 ones
input = (input & mask1) + ((input >> 1) & mask1)
input = (input & mask2) + ((input >> 2) & mask2)
input = (input & mask3) + ((input >> 4) & mask3)
input = (input & mask4) + ((input >> 8) & mask4)
input = (input & mask5) + ((input >> 16) & mask5)
input = (input & mask6) + ((input >> 32) & mask6)
return uint8(input)
}
``` |

^{6}-words-long binary (i.e. a 128 000 000 bits sequence on a 64-bit architecture). The Erlang solution touched one bit at a time, while the C and Go solutions chopped 8 bytes at a time. Here's the results:

Test size: 2000000 words (64 bits per words) native Erlang took....... 453 ms raw C took............... 11 ms raw Go took.............. 17 ms C called from Erlang took 455 ms C called from Golang took 120 msConclusion: There appears to be quite a bit of overhead calling C functions from both Erlang and Golang, most likely due to the work of mapping/casting types back and forth from one languague to another. Interestingly, raw Golang's performance was pretty close to maximally optimized C (with gcc's -O3 flag). All the code can be viewed/downloaded on Github.

- Alex, Apr. 16

^{th}2013 (edited Jan. 25

^{th}2014)