Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of ��parallel for�� loops the bulk parallel model is not too limiting, but some parallel patterns��such as nested parallelism��cannot be��
]]>