erasure-code/isa/xor_op: add neon-based region_xor implementation
The load instruction of NEON can load 128 bits. Generally, the CPU has
two load channels. Therefore, the 32-byte Region_xor can be implemented.
According to the test by ceph_erasure_code_benchmark, the performance
is improved by more than 20% ~ 50% on average.
loop = 10000
(k, m, size) | base(s) | neon(s)
------------------------------------------
(4, 1, 16384) | 0.018 | 0.015
------------------------------------------
(4, 1, 65536) | 0.043 | 0.037
------------------------------------------
(4, 1, 102400) | 0.058 | 0.049
------------------------------------------
(8, 1, 32768) | 0.034 | 0.029
------------------------------------------
(8, 1, 65536) | 0.052 | 0.045
------------------------------------------
(8, 1, 102400) | 0.068 | 0.061
------------------------------------------
(8, 1, 524288) | 0.631 | 0.420
------------------------------------------
(8, 1,
1048576) | 1.561 | 0.931
------------------------------------------
(8, 1,
8388608) | 16.70 | 8.244
------------------------------------------
Signed-off-by: chenxuqiang <chenxuqiang3@hisilicon.com>