v4.0 [Mar 29, 2011]
Share GPUs across multiple threads
Use all GPUs in the system concurrently from a single host thread
No-copy pinning of system memory, a faster alternative to cudaMallocHost()
C new/delete and support for virtual functions
Support for inline PTX assembly
Thrust library of templated performance primitives such as sort, reduce, etc.
NVIDIA Performance Primitives (NPP) library for image/video processing
Layered Textures for working with same size/format textures at larger sizes and higher performance
Unified Virtual Addressing
GPUDirect v2.0 support for Peer-to-Peer Communication
Automated Performance Analysis in Visual Profiler
C debugging in CUDA-GDB for Linux and MacOS
GPU binary disassembler for Fermi architecture (cuobjdump)
Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.