I have problems with rigs that only have 4 gigs of memory and mixed cards. (amd/Nvidia) So the daggerfile should be loaded in chunks to avoid the usage of virtual memory.
Each instance of ethminer keeps a copy of the full dag in system RAM to validate a solution. However, if it is not present, it will perform a slightly slower "light" validation (https://github.com/Genoil/cpp-ethereum/blob/master/libethcore/EthashAux.cpp line 267-276). If you would load the dag in chunks, send them over to the GPU and then release the system RAM DAG, it could work at a slight performance hit. However, this is not at all what I did in the opencl-chunks branch. The only thing I did there was allocating and sending over the DAG to GPU RAM in chunks.
My 1.0.3 release adds the option to load the DAG from a different file location, so you can do mixed CUDA/OpenCL mining from two instances. However this requires double the DAG RAM, so it's far from ideal. It would be much better to be able to have ethminer auto-select the right GPGPU platform per Worker thread so you can do mixed CPU/OpenCL/CUDA mining from a single ethminer instance. This shouldn't be too much of a hassle to get working. I'll see if I can get that into 1.0.5 (1.0.4 is for stratum support, which I'm fighting with currently).
You obviously know way more than I do but last time I checked that option it stated that it ignored it for GPU mining:
Creating 128 chunked buffers for the DAG
Loading single big chunk kernels because GPU doesn't care about chunks!
Yes I wrote that. The point is, that the original chunks implementation by ETH:DEV (that is disabled in the source btw), used a specific kernel that passed 4 separate pointers to each of the 4 DAG chunks into the kernel. My implementation however did load the DAG in chunks, but assumed the chunks would be allocated contiguously in GPU RAM and therefore just passed a single pointer to the beginning of the DAG into the kernel. I found this apporach somewhere on AMD's opencl forums. It turned out to work fine on Nvidia's OpenCL implementation but not on AMD's. I think there was some kind of fragmentation going on there, but as I don't own any AMD hardware, I discarded the idea of solving anything with that. The original goal of the chunks approach was to avoid the DAGpocalypse, but it seems now that the problems some people seem to have now with DAG allocation are not with the GPU RAM, but with the amount of system RAM on Windows (see here for instance: http://forum.ethereum.org/discussion/comment/18222/#Comment_18222).