Using SSE2 intrinsic calls may speed up your program execution substantially. However,
nvcc
seems to be unable to compile SSE2 code. For example including
emmintrin.h
or equivalent will give errors like this:
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(48): error: identifier "__builtin_ia32_emms" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(61): error: identifier "__builtin_ia32_vec_init_v2si" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(90): error: identifier "__builtin_ia32_vec_ext_v2si" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(114): error: identifier "__builtin_ia32_packsswb" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(129): error: identifier "__builtin_ia32_packssdw" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(144): error: identifier "__builtin_ia32_packuswb" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(158): error: identifier "__builtin_ia32_punpckhbw" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(172): error: identifier "__builtin_ia32_punpckhwd" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(186): error: identifier "__builtin_ia32_punpckhdq" is undefined
.
.
.
Error limit reached.
100 errors detected in the compilation of "/tmp/tmpxft_000010b9_00000000-4_template.cpp1.ii".
Compilation terminated.
make: *** [obj/release/template.cu_o] error 255
To come around this problem, you need to compile the code using SSE2 using
gcc
and your CUDA code using
nvcc
and then link them together afterwards.
So create a separate
.c
and
.h
file where you create a function that execute the SSE2 intrinsic calls. Include the
emmintrin.h
file in the
.c
file, since doing so in the
.h
will get you the same result as above because nvcc will read the
.h
file.
To use the SSE2 intrinsic function from your
.cu
file, you need to include the new
.h
, but in
extern
brackets like this:
extern "C" {
#include "yourfile.h"
}
Finally, you need to compile the files separately using
nvcc
and
gcc
and then link them together:
gcc cpuCode.c -o cpuCode.o
nvcc cudaCode.cu -o cudaCode.o
gcc cudaCode.o cpuCode.o -o progExe
Note: This is just an illustration, these 3 lines won't work by themselves, you need to include libaries etc.