mem_is_zero is fast for X64 where 128-bit registers are available,
but it's very easy to optimze it for 32-bit Intel and ARM CPUs as
well, the speed won't be anywhere near the fastest one but still almost
7x faster than regular byte-by-byte check.
Now with extra test to check for corner cases that may pop with such
implementations.
Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>