### Computing the Hamming weight

#### Exploring performance and interoperability between Erlang, Golang and C

I was investigating how easy/difficult it is to call raw C functions from Erlang or Go so I wrote some notes here. I used the Hamming weight for my benchmark as it is self-contained and allowed me to make use of Erlang superb binary handling.

The Hamming weight (also known as population count or popcount) of a binary (a sequence of zeroes and ones) is simply its total number of up-bits (ones). There exists different methods to compute the Hamming weight, all having different levels of sophistication and efficiency. This post contrasts the performance of different methods in Erlang, Go and C, as well as the performance when calling the C function from the two higher level languages.

The most naive method as one would expect is to look at the bits one at the time and sum them. In Erlang's functional, pattern-matching-heavy fashion, you recurse over the tail of the binary while keeping count with an accumulator variable:

 ```1 2 3 4 5``` ```weight(Bitstring) -> weight(Bitstring, 0). weight(<<>>, Acc) -> Acc; weight(<<0:1, Tail/bits>>, Acc) -> weight(Tail, Acc); weight(<<1:1, Tail/bits>>, Acc) -> weight(Tail, Acc+1). ```

The other method uses clever bitwise manipulation to leverage the speed of assembler instructions. The gist of this method is a divide-and-conquer approach where pre-computed binary masks are applied to the binary of interest and the count of up-bits in each sub-binary is summed in a tree pattern. Let's do a simplistic example for 8-bits integer in pseudocode (I use // for comments):
```input: x, an 8-bit integer // for this example, let x be 10110011

let a = x & 01010101      // a is 00010001
let b = (x>>1) & 01010101 // b is 01011001 & 01010101 -> 01010001
let c = a + b             // c is 00010001 + 01010001 -> 01100010

let d = c & 00110011      // d is 01100010 & 00110011 -> 00100010
let e = (c>>2) & 00110011 // e is 00011000 & 00110011 -> 00010000
let f = d + e             // f is 00100010 + 00010000 -> 00110010

let g = f & 00001111      // g is 00110010 & 00001111 -> 00000010
let h = (f>>4) & 00001111 // h is 00000011 & 00001111 -> 00000011
let i = g + h             // i is 00000010 + 00000011 -> 00000101
```
In base 10, i is 5 and the answer we are looking for. The C code below do the same trick as above, except with 64-bit binaries:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```unsigned int weight(unsigned long long input) { unsigned long long mask1, mask2, mask3, mask4, mask5, mask6; mask1 = 6148914691236517205ULL; // 01010101... mask2 = 3689348814741910323ULL; // 00110011... mask3 = 1085102592571150095ULL; // 00001111... mask4 = 71777214294589695ULL; // 8 zeroes, 8 ones, etc... mask5 = 70367670468607ULL; // 16 zeroes, 16 ones, etc... mask6 = 4294967295ULL; // 32 zeroes, 32 ones input = (input & mask1) + ((input>>1) & mask1); input = (input & mask2) + ((input>>2) & mask2); input = (input & mask3) + ((input>>4) & mask3); input = (input & mask4) + ((input>>8) & mask4); input = (input & mask5) + ((input>>16) & mask5); input = (input & mask6) + ((input>>32) & mask6); return (unsigned int) input; } ```

Golang, with its very C-like syntax, is almost a copy of the above:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```func popcount(input uint64) uint8 { var mask1, mask2, mask3, mask4, mask5, mask6 uint64 mask1 = 6148914691236517205 // 01010101... mask2 = 3689348814741910323 // 00110011... mask3 = 1085102592571150095 // 00001111... mask4 = 71777214294589695 // 8 zeroes, 8 ones, etc... mask5 = 70367670468607 // 16 zeroes, 16 ones, etc... mask6 = 4294967295 // 32 zeroes, 32 ones input = (input & mask1) + ((input >> 1) & mask1) input = (input & mask2) + ((input >> 2) & mask2) input = (input & mask3) + ((input >> 4) & mask3) input = (input & mask4) + ((input >> 8) & mask4) input = (input & mask5) + ((input >> 16) & mask5) input = (input & mask6) + ((input >> 32) & mask6) return uint8(input) } ```
I timed five different solutions: (i) the native recursive Erlang solution, (ii) the raw C solution, (iii) the raw Golang solution, (iv) the C function called from the Erlang VM through native implemented functions (NIF) and (v) the C function called from Golang using the cgo module. As a quick-and-dirty benchmark, I created a 2x106-words-long binary (i.e. a 128 000 000 bits sequence on a 64-bit architecture). The Erlang solution touched one bit at a time, while the C and Go solutions chopped 8 bytes at a time. Here's the results:
```Test size: 2000000 words (64 bits per words)
native Erlang took....... 453 ms
raw C took...............  11 ms
raw Go took..............  17 ms
C called from Erlang took 455 ms
C called from Golang took 120 ms
```
Conclusion: There appears to be quite a bit of overhead calling C functions from both Erlang and Golang, most likely due to the work of mapping/casting types back and forth from one languague to another. Interestingly, raw Golang's performance was pretty close to maximally optimized C (with gcc's -O3 flag). All the code can be viewed/downloaded on Github.

- Alex, Apr. 16th 2013 (edited Jan. 25th 2014)