I might release the Quark bins (and a miner able to run them) for a smaller amount of BTC... problem is that SGMiner is infected with the GPL, and I really don't want to release the host code in this case.
maybe we could update the public sgminer sources with a simple multi-kernel version of quark, than you'd sell your optimized version to use with it.
that is, unless there are fine tricks in the C part ;-)
There are indeed. You see, for every branch in Quark, I read 4 bytes from the GPU on the host side - also, I run both sides of the branch in parallel, then block using OpenCL events until both complete before continuing. The key to Quark is how you do the general structure - the current source is beyond stupid. While I'll explain in detail (even publicly) how it's done if you want, I loathe to have to release working source for it. If people want it, they need to at least read the description and implement it themselves.
I would be interested in that description. Just saying.
All right. First, it's necessary to understand WHY the original Quark is so stupid - it deals with how GPUs tend to handle branching, that is, if/else decisions. They more or less don't. It'll execute both sides of the branch - whichever result isn't needed it will toss away. This means for every branch in Quark (there are three), the stock miner executes one extra hash function that it didn't need to.
Now, you certainly can't read back the whole output of the hash for every work-item - it'd be fuck slow - so the host needs to be able to execute ONLY the hash that needs to be done for a given branch, WITHOUT branching on the GPU. I thought about this for a good while, and here is what I came up with. The host may not be able to read and process every hash for every work item in a reasonable timeframe, but what if I needed to read two 32-bit integers from the GPU only, and act on them, regardless of how many work items are run? With num_branches as the number of decisions on different hashes to run in the algo, and branch_possibilities being the number of possible ways a branch can go, allocate num_branches * branch_possibilities buffers. For Quark and Animecoin, this is three branches times two possible outcomes per branch. Each buffer should be the size of your nonce times the number of global work-items, plus one.
Here's how you use them. Do the first hash in Quark, Blake-512, and then zero out the LAST INDEX of the buffers for the outcomes of the first branch. Index meaning the size of your nonce. This is done for speed - the buffers can be filled with garbage, as long as that is fine. The second hash in Quark is BMW-512, and it decides whether Groestl-512 is run, or Skein is. Pass it the hash states, the two buffers for our first branch, and the number of global work items (also the size of the branch buffers minus one; can't be gotten from OpenCL device code.) Do the hash, and the decision in Quark is whether the fourth bit of the output is set. If it is, Groestl is run, IIRC, otherwise, Skein. Now, if it's set, atomically read and then increment the last index of the first branch buffer, and store the nonce in the index you read. This is a counter, and how nonces are stored as well - this is why we zeroed it. If it's NOT set, do the same for the other buffer. Now, in your host code, you simply read out both numbers, each one being the number of branches that go to each possibility, and you launch that hash's kernel with the appropriate branch nonce buffer, and the number of global work-items as the number of branches that went that way. Inside this kernel, the global ID is then used as an index into the branch nonces buffer - it'll pull out each and every nonce that is appropriate for that branch. Since we use the original global ID as not only a nonce, but an index into the hash states - the hash states may be indexed with the number you pulled out of the nonces buffer to fetch and store the state for that work.
Final note for optimization - I lied. You actually need to read only ONE nonce-sized entry from a branch buffer; I said two because it makes the process easier to understand. Basic algebra - since we know the global work-item count already, all work items branch SOME way, and they can only go two ways, the number of work-items that branched, if added, will equal work-item count. i.e. GlobalWorkItems - Branch1LeftCount = Branch1RightCount. So... read only one of those entries denoting the number of nonces stored in a branch buffer (doesn't matter which), and subtract it from the amount of global work items to get the other.
So, that's my technique for the overall structure of Quark. Any questions?
Pallas, we're not so different. While you believe in open source and I sometimes do not, I still believe in sharing knowledge, even to my detriment. If someone's reasonbly intelligent and willing to work at it, I'll never refuse to help them learn.
Hmmm. Well, that definitely changes how I think about GPU processes. I did not realize they execute both branches. Also, thanks for the detailed description. I greatly appreciate it.