Draft idea: two values that differ only in one bit will be separated by Hamming distance of 1. Miner could backet hashes by Hamming weight, possibly discarding hashes with low and high weight, then generate Bloom filter for even backets and perform search for odd ones in two neighbor (even, +-1) backets.
Aaaand, props to smolen for coming up with the idea that would in fact put that task in reach of a regular GPU.
If you start with an exemplar given string of bits, make a hash table, and then every time you have a collision you evict whichever one is furthest in Hamming distance from your example string, then after a while you will collect a bunch of strings clustered very closely in Hamming space and stop updating your table very often - erasing the memory contention constraint and allowing all those parallel processors to hunt, without even needing to consult the table, for strings within a given Hamming distance of the original example. And when they find one, they light it up; the table of things clustered in Hamming space around that original example string will yield dozens, or even hundreds, of near-triplets that include the new string.
With the parallelism and memory bandwidth advantages, they'd rapidly reach a state where they're able to find near-triplets very efficiently indeed.
If you do the same trick with the 64G memory structure, you'll be able to collect a much larger Hamming radius and light it up with a much greater fraction of new inputs, but you can't parallelize the search to more than the 8 cores or whatever on your CPU and, because the searching threads don't even need to consult the table if they're looking for things within a small Hamming radius of an example, the GPU can.
So now I get to try to come up with some weird variation on the Cuckoo cycle.