That is why test cycles exist, to make sure things work correctly, not blindly. That computations are correct.
All the outputs can be easily verified, simply check if some DP is valid using a known CPU routine. And that can be part of the running program, not part of a test suite.
If some point P is spit out by the GPU to be a DP then the check is really easy: compute base point + distance * G and assert that the point is indeed a valid DP. What more would you want as a guarantee?
All that matters is that there is a steady DP output. And I think people are way too deep into the paradigm of how the problem was handled by JLP + clones and forget the essentials, or think nothing new can be added or improved heavily. The 125-bit limitation is simply there because that is how he chose to output the DPs, stripped to the maximum problem size he wanted to solve. The calculations are of-course in full-bit mode, it can't be otherwise when dealing with the field arithmetic. But there are multiple other slowdowns, not just the 125-bit limit.
585 million group ops/second using 35 W of power, currently. This is around 50x faster than on a single-core i9, and with way less TDP.
Maybe when I find some time I can publish a compiled CUDA kernel to solve for some smaller ranges, so skeptics can see I'm not BS-ing at all about this, while also preserving by IP.