Friday, May 1, 2009

CUDA and SSE2 intrinsics

Using SSE2 intrinsic calls may speed up your program execution substantially. However, nvcc seems to be unable to compile SSE2 code. For example including emmintrin.h or equivalent will give errors like this:

/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(48): error: identifier "__builtin_ia32_emms" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(61): error: identifier "__builtin_ia32_vec_init_v2si" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(90): error: identifier "__builtin_ia32_vec_ext_v2si" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(114): error: identifier "__builtin_ia32_packsswb" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(129): error: identifier "__builtin_ia32_packssdw" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(144): error: identifier "__builtin_ia32_packuswb" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(158): error: identifier "__builtin_ia32_punpckhbw" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(172): error: identifier "__builtin_ia32_punpckhwd" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(186): error: identifier "__builtin_ia32_punpckhdq" is undefined
.
.
.
Error limit reached.
100 errors detected in the compilation of "/tmp/tmpxft_000010b9_00000000-4_template.cpp1.ii".
Compilation terminated.
make: *** [obj/release/template.cu_o] error 255


To come around this problem, you need to compile the code using SSE2 using gcc and your CUDA code using nvcc and then link them together afterwards.

So create a separate .c and .h file where you create a function that execute the SSE2 intrinsic calls. Include the emmintrin.h file in the .c file, since doing so in the .h will get you the same result as above because nvcc will read the .h file.

To use the SSE2 intrinsic function from your .cu file, you need to include the new .h, but in extern brackets like this:

extern "C" {
#include "yourfile.h"
}


Finally, you need to compile the files separately using nvcc and gcc and then link them together:

gcc cpuCode.c -o cpuCode.o
nvcc cudaCode.cu -o cudaCode.o
gcc cudaCode.o cpuCode.o -o progExe


Note: This is just an illustration, these 3 lines won't work by themselves, you need to include libaries etc.

No comments:

Post a Comment