Tuesday, May 5, 2009

CUDA and double precision floating point numbers

If you are in any way using CUDA to computations involving doubles, you need to perform a few adjustments to have it work. By default CUDA has double support disabled, this entails that all doubles are silently converted into floats inside kernels and any double precision calculations computed are incorrect.

To enable the use of doubles inside CUDA kernels you first need to make sure you have a CUDA Compute 1.3-capable card. These are the newer versions of the nVidia CUDA cards such as the GTX 260, GTX 280, Quadro FX 5800, and Tesla S1070 and C1060.
Thereby you have to add a command line options to the nvcc compiler: --gpu-architecture sm_13 .

Note: you can use sm_11 and sm_12 to enable functionality added in CUDA 1.1 and CUDA 1.2 respectively.

Source: https://www.cs.virginia.edu/~csadmin/wiki/index.php/CUDA_Support/Enabling_double-precision and http://forums.nvidia.com/index.php?showtopic=84999&pid=481809&mode=threaded&start=#entry481809

Sunday, May 3, 2009

Robots.txt blocking blogger blogs?

Google Wemaster Tools reports for example: http://www.herikstad.net/search/label/clean%20cron is blocked by robots.txt

No worries however, according to the FAQ and http://www.google.com/support/forum/p/Webmasters/thread?tid=33c2e6597951a702&hl=en, this is just to remove results from the blog search tool included on the page (the URL contains "search" directory). Including the results would give duplicate sites in Google search and give incorrect visiting stats etc.

If you don't have "search" in your URL error, check out the link above.

Friday, May 1, 2009

CUDA and SSE2 intrinsics

Using SSE2 intrinsic calls may speed up your program execution substantially. However, nvcc seems to be unable to compile SSE2 code. For example including emmintrin.h or equivalent will give errors like this:

/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(48): error: identifier "__builtin_ia32_emms" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(61): error: identifier "__builtin_ia32_vec_init_v2si" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(90): error: identifier "__builtin_ia32_vec_ext_v2si" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(114): error: identifier "__builtin_ia32_packsswb" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(129): error: identifier "__builtin_ia32_packssdw" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(144): error: identifier "__builtin_ia32_packuswb" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(158): error: identifier "__builtin_ia32_punpckhbw" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(172): error: identifier "__builtin_ia32_punpckhwd" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.1.2/include/mmintrin.h(186): error: identifier "__builtin_ia32_punpckhdq" is undefined
Error limit reached.
100 errors detected in the compilation of "/tmp/tmpxft_000010b9_00000000-4_template.cpp1.ii".
Compilation terminated.
make: *** [obj/release/template.cu_o] error 255

To come around this problem, you need to compile the code using SSE2 using gcc and your CUDA code using nvcc and then link them together afterwards.

So create a separate .c and .h file where you create a function that execute the SSE2 intrinsic calls. Include the emmintrin.h file in the .c file, since doing so in the .h will get you the same result as above because nvcc will read the .h file.

To use the SSE2 intrinsic function from your .cu file, you need to include the new .h, but in extern brackets like this:

extern "C" {
#include "yourfile.h"

Finally, you need to compile the files separately using nvcc and gcc and then link them together:

gcc cpuCode.c -o cpuCode.o
nvcc cudaCode.cu -o cudaCode.o
gcc cudaCode.o cpuCode.o -o progExe

Note: This is just an illustration, these 3 lines won't work by themselves, you need to include libaries etc.

CUDA "unspecified launch failure"

The error "unspecified launch failure" usually means the same as "segment fault" for host code. Check that your code does not try to access any areas outside the arrays being used. A common mistake is using 'the whole idx' instead of just the thread id to access shared memory. Here's an example:

int idx = blockIdx.x * blockDim.x + threadIdx.x;
shared[idx] = input[idx];

will give you an error and should look like this:

int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
shared[tid] = input[idx];

CUDA kernel errors

To print any errors that may be returned when you execute a kernel in a human readable way (you get the errors defined in the programming guide), add the following lines after your kernel call:

err = cudaGetLastError();
if (err != cudaSuccess) printf("%s\n", cudaGetErrorString( err ) );