I assume you are confused because you believe that hardware normally checks all nonces all the time.
Dude! I have to assume that you've again went into the mode where you can't read with comprehension and started arguing with the voices in your head. I normally agree with over 90% what you say, except when you start ranting like when you've claimed that Apple Computer doesn't take money orders. Please chill out or maybe switch to decaf?
Obviously this test has to be hardware specific, because e.g. bitfuries only test 768/1024 of the available nonce space. This would have to be even more complex to account for failed sub-engines in the multi-engine chips.
Anyway, I think that the problem is solvable. Maybe a Bloom filter of "hashing blind spots". It certainly would require a cooperation from the ASIC vendors with the developers of the miner and the pool software. As most of the things in Bitcoin mining it will be some sort of probabilistic solution, not a mathematical proof of failure.
The first person posting on this forum about false negatives in mining was hardcore-fs with his XUPV5 FPGA miner, but I don't think that he published his results nor code.
Likewise Spondoolies made promises of their ASIC designer to start posting on the forum after Passover, but apparently they haven't had time to do this yet.
In general, I'm optimistic that problem will be solved. Nowadays, who remembers when the long-polls were the major problems for mining pools? From my experience in the electronic and software industry: one person's problem is other person's opportunity. For example: NTSC started as "Never Twice the Same Color" and ended up where even the cheapest TV set receiver chipsets have been precisely calibrating themselves 29.97 times per second (using Vertical Interval Reference and Ghost Canceling Reference). And that was done quite effectively over all-analog broadcast retransmission networks including satellite links, where the end-users were completely oblivious to the problems in the transmission channel. All it took was to build a reasonable model of the distortion and run the tests frequently enough.
Quoting below just for future reference:
You are correct that there isn't any observable difference between the 2 types of problems. And in practice, I think it's extremely unlikely that someone with hardware that has catastrophic levels of false negative errors is unaware of the issue, so the intent is likely equivalent as well.
When you have people of very questionable competence designing silicon on crash schedules it's inevitable that serious hardware defects are going to be in play. And with the money involved, if there is a way to externalize the cost of failure to pool participants people are going to choose that over eating the loss themselves.
I think it's an absolute must that stratum allow pools to send test jobs to validate hardware integrity in a blind way that allows detection of malicious withholding as well. Without that, I expect the days of public pools are numbered, and with them all hope of even modest decentralization will go as well.
It has nothing to do with hardware errors (although they are another minor source of false positives).
Simple version:
If nonce 728937289 solves the block for a given unit of work and a particular hardware for a variety of reasons doesn't check nonce 728937289 it is not going to find a solution.
If a user doesn't return a solution is it because he is cheating or is it because his hardware didn't check that nonce with that work? The answer is there is absolutely no way to know.
It isn't a hardware fault or error.
If nonce 728937289 solves the block for a given unit of work and a particular hardware for a variety of reasons doesn't check nonce 728937289 it is not going to find a solution. It is that simple. I assume you are confused because you believe that hardware normally checks all nonces all the time. Nothing could be further from the truth. There are dozens of reasons why a nonce wouldn't be checked and no software is going to report that as an error. HW errors as reported by cgminer and the like are the result of the hardware reporting that work A + nonce B produces a hash of particular difficulty and it doesn't.
I think that Stratum Mining protocol can be extended to allow the client to send to the pool server the maps of the "blind spots" in the nonce space.
It isn't a static range. Say you have an ASIC with 64 cores the designer may decide to take the nonce range (2^32) and break it into 64 chunks. It does this by assinging an offset to each core. So all the cores get the same work, each one starts at a difference nonce value and they all increment 2^26. However due to yields not all chips will have all 64 cores operating. The seller may have designed it around 60 cores being "good enough" to achieve the hashrate. So if 4 cores are "off" then that means millions of nonces will never even be attempted. Most ASICs also work on dynamic load so as the hardware error rate and/or temp rise it shuts off the cores. So the same nonces won't be covered all the time.
Still it goes way beyond just which nonces are checked. Both the ASICs and the mining software queue work. So what happens when a miner has 30 seconds worth of work queued up and you send him the "test work". Are you going to hold the block for 30 seconds which would mean a >7% orphan rate? What happens if due to latency the miner returns the proper solution but only after you have broadcast the block? If he cheating or just slow or just had a lot queued up? How are you going to avoid the attacker just sharing info between workers? Say you send the test work to 10% of workers and wait 30 seconds. You probably will end up with hundreds maybe thousands of false positives AND 7%+ orphan rate (on top of whatever you are losing to attacks). If the attacker had say 30 accounts there is a less than 3% chance that 1 and only one would be given the test work and end up withholding it. So 3% of the time you will catch the worker, due to false positives lets say you don't boot someone until they fail 3 times. That means on average the attacker will pass 99 blocks before failing out. However you are only testing him 10% of the time so you MIGHT (I doubt it) catch him after he withholds or attempts to withhold 990 blocks.
Of course even your legit users are going to start working against you (or just flee to where they aren't attacked by the pool). Those who want to stay will probably design a work relay system so they can identify your test works if for no other reason than to be falsely kicked out (especially if you confiscate the work completed). The attacker could join those relay networks and all but guarantee he would pass all tests.
Also don't take this is exhaustive I just don't feel like covering every possible scenario. These issues are just the tip of the iceberg.
The only way to prevent these types of attacks is for the pool to know something the miner doesn't. The current block hashing protocol doesn't make that possible.