Author

Topic: [ANN][YAC] YACoin ongoing development - page 173. (Read 379868 times)

newbie
Activity: 14
Merit: 0
If I had to guess he read the keccak whitepaper which talks about optimizing the code, there's apparently some instructions that are not necessary or something. I'm also not taking advantage of the vector types. The core ChaCha code as an example, takes 16 uint's and could be done using either a uint4 [4] or uint8 [2] or uint16 -- but, as I mentioned, I've got no clue if that would actually make it faster.

After reading the PBKDF2 specs and pseudocode, I also think there are too many steps in the scrypt-jane code -- maybe I'm just reading it wrong -- that are not necessary for what we're doing.

Also, to figure out endianess query the "CL_DEVICE_ENDIAN_LITTLE" property through OpenCL, that way you'll know for sure.
sr. member
Activity: 425
Merit: 262
? The Radeon HD's are Little Endian, easiest way to check is to query the props via OpenCL

Also you can make a coin GPU resistant by requiring a lot of memory to get good speed. The thing about gpus is that they're highly paralellization but slow in serial processes. So if you can modify scrypt to remove the tmto and force all N precalculated spaces to be required it would reduce the effectiveness. Go even further and make N *and* r increase over time

I searched on google, there're some page saying "Radeon GPUs are big-endian" ...
sr. member
Activity: 347
Merit: 250
So, I'll chime in, I wrote one too and get pretty abysmal rates (just got around to testing it -- 150 kH/s or so, under N=6) using a 7950. I know my implementation can be cleaned up but it would require understanding keccak better and since I'm new to OpenCl and basic optimization information is pretty hard to find, I'm not sure I want to put in the effort. I put in a lookup gap capability but don't need to use it.

Interestingly, your hash rate is right in the same ballpark as mine for N=6 on a 6950.  Guess that makes 2 of us in the catastrophically unoptimized OpenCL implementation club..  Probably won't surprise anyone that I'm staring at my OpenCL code right now trying to figure out what I did wrong, as I'd consider mtrlt to be a reliable source (as the first person to publicly write an OpenCL scrypt implementation, and just about everyone mining Litecoin on GPU's is using his OpenCL code).
sr. member
Activity: 425
Merit: 262

This confirms the criticism that N started out too low. My calculations show N/KB increases at/around the following dates:
5/21: 256, 32KB
5/30: 512, 64KB
6/2: 1024, 128KB
6/26: 2048, 256KB
7/8: 4096, 512KB
8/14: 8192, 1024KB

How do you feel your YAC GPU kernel performance will hold up off of those adjustments (in absolute terms and in relative terms to a high end CPU)?

I did check out Tacotime's MC2 paper, I like the approach he takes with varying the hash algorithm to achieve maximum ASIC/FPGA resistance. Unfortunately, building GPU resistance for any good length of time looks like a much harder (impossible?) task.


I don't think resistance to any technology in mind is a good way. The key point is the strength of the network, the security of the network and the fair reward for maintaining the network efficiency.
newbie
Activity: 14
Merit: 0
? The Radeon HD's are Little Endian, easiest way to check is to query the props via OpenCL

Also you can make a coin GPU resistant by requiring a lot of memory to get good speed. The thing about gpus is that they're highly paralellization but slow in serial processes. So if you can modify scrypt to remove the tmto and force all N precalculated spaces to be required it would reduce the effectiveness. Go even further and make N *and* r increase over time
sr. member
Activity: 425
Merit: 262
I tried last weekend to port the scryp jane code to open cl, but I only achieved at 60kH with 7950, besides the hash is not valid. So I give up because at that point I don't think it's worthwhile to keep on.

I think the main problem is that Radeon GPU uses big-endian, I might not treat it correctly at some point.

I used the cgminer, but I have very little knowledge about its implementation and I also new to these open cl and its system calls (too much functions ... that's why I don't like to use them).

The way I debug it is that I compiled the cl kernel source (modified slightly) with gcc directly. It actually produce the same hash in comparison with the original one. But the endian problem still exists ...
sr. member
Activity: 347
Merit: 250
I did check out Tacotime's MC2 paper, I like the approach he takes with varying the hash algorithm to achieve maximum ASIC/FPGA resistance. Unfortunately, building GPU resistance for any good length of time looks like a much harder (impossible?) task.

This is likely true.  I think the best one can hope for is to narrow the performance gap between CPU's and GPU's by making the memory usage large enough that it gets pushed out to significant amounts of external RAM.  At that point, external RAM bandwidth is the deciding factor.  We're already seeing GPU's with wider external memory busses than many CPU's however.

At best it's an arms race.  It appears that with YAC, the lag time for people to implement it on OpenCL still gave early CPU miners a significant head-start (I just wish I was one of those that actually got the head-start).
newbie
Activity: 14
Merit: 0
So, I'll chime in, I wrote one too and get pretty abysmal rates (just got around to testing it -- 150 kH/s or so, under N=6) using a 7950. I know my implementation can be cleaned up but it would require understanding keccak better and since I'm new to OpenCl and basic optimization information is pretty hard to find, I'm not sure I want to put in the effort. I put in a lookup gap capability but don't need to use it.

Btw, does anyone know if using the vector types in OpenCL on AMD is faster than using the scalar types? I haven't been able to find any definite statements leaning one way or the other ...

I wonder if mtrlt wrote is own keccak implementation or went around porting the optimized version they released. I went about it by copying over the code from scrypt-jane and direct porting it to OpenCL.

Oh and as an fyi, I didn't do it to actually make a miner, I just wanted to see how much effort it would take and how fast it would be.
sr. member
Activity: 333
Merit: 250
"Raven's Cry"
does anyone kind nice guy here could tell me with simple instructions how to remove that warning from the Yac client, or maybe tell if not doing it what worse can happen?

ty Kiss
sr. member
Activity: 462
Merit: 250
Quote from: WindMaster
You would indeed be one of the people I'd expect to have modified your own OpenCL code for scrypt+chacha fairly early on.  Anyway, willing to post some hash rate info for your kernel at the current N=128 for a given GPU type and lookup gap (if any)?  You can probably post that info safely without giving anyone a head-start on making the modifications themselves.

The knowledge might give others incentive to do it, but oh well. Currently (N=128) it does 3.4MH/s on a core-underclocked (830->738) HD6990, with lookup_gap at 1, thus no gap. As a curiosity, at N=32, it does 7.3MH/s under the same setup.

This confirms the criticism that N started out too low. My calculations show N/KB increases at/around the following dates:
5/21: 256, 32KB
5/30: 512, 64KB
6/2: 1024, 128KB
6/26: 2048, 256KB
7/8: 4096, 512KB
8/14: 8192, 1024KB

How do you feel your YAC GPU kernel performance will hold up off of those adjustments (in absolute terms and in relative terms to a high end CPU)?

I did check out Tacotime's MC2 paper, I like the approach he takes with varying the hash algorithm to achieve maximum ASIC/FPGA resistance. Unfortunately, building GPU resistance for any good length of time looks like a much harder (impossible?) task.
sr. member
Activity: 347
Merit: 250
Quote from: WindMaster
You would indeed be one of the people I'd expect to have modified your own OpenCL code for scrypt+chacha fairly early on.  Anyway, willing to post some hash rate info for your kernel at the current N=128 for a given GPU type and lookup gap (if any)?  You can probably post that info safely without giving anyone a head-start on making the modifications themselves.

The knowledge might give others incentive to do it, but oh well. Currently it does 3.4MH/s on a core-underclocked (830->738) HD6990, with lookup_gap at 1, thus no gap. As a curiosity, at N=32, it does 7.3MH/s under the same setup.

Oyy, my implementation was pretty shitty then.  You went 20x faster than I did at N=32, though I was on a 6950.


Quote from: WindMaster
Out of curiosity, how many hours after launch or after you started modifying your OpenCL kernel did it take you to make the changes?  Mainly a point of curiosity, for comparison with how many hours it took me.  I went from scratch rather than modifying the Reaper/cgminer kernel though, so my hour comparison will differ a bit because of that.

I was a bit late, I started coding the miner about 16h after the launch. It took me 13.5h from start, to a working implementation. It was very intensive, as you might imagine. Difficulty rising like no tomorrow, and my code only gave errors, until it finally worked.

About 8 hours here, but as you can see, my benchmark test was almost catastrophically slower than yours.  And that was just to the point of being able to get valid hashes for benchmark purposes, not to finish out an all-up miner.  Now I'd be inclined to say my implementation is flawed.


Debugging OpenCL code is horrible. :-)

+1


The knowledge might give others incentive to do it

There's still a pretty large technical knowledge barrier to entry though.  I suspect everyone with the correct skillset and OpenCL experience already went for it.  Though this may give incentive for everyone to start figuring out how best to optimize it..
sr. member
Activity: 347
Merit: 250
What happens if you leave the code checking if checkpoint is too old? Does "checkpoint too old" warning remains? In other words, is it enough to just add newer checkpoint to list?

Yeah, but only for a while and then the warning will pop up again until more recent checkpoints have been added.  That would require everyone to update and rebuild the client again periodically (at least every 10 days with the way the check is written), which is probably something we'll want to get away from for long-term YAC adoption.  In fact, all the people running the official YAC client are seeing that warning right now, which probably won't be helping the public appearance of YAC (perhaps similar effect to the poorly worded warning on bter?).

My opinion is the check was only necessary during the early stage of the coin launch when the probability was high that someone could 51% it by mining off-network and reintroducing their chain later on, before the network hash rate increased enough to make it expensive to do so.
member
Activity: 104
Merit: 10
Indeed.  You passed the test to check if you know what you're talking about.  Smiley

Crafty. I like your style.

Quote from: WindMaster
You would indeed be one of the people I'd expect to have modified your own OpenCL code for scrypt+chacha fairly early on.  Anyway, willing to post some hash rate info for your kernel at the current N=128 for a given GPU type and lookup gap (if any)?  You can probably post that info safely without giving anyone a head-start on making the modifications themselves.

The knowledge might give others incentive to do it, but oh well. Currently (N=128) it does 3.4MH/s on a core-underclocked (830->738) HD6990, with lookup_gap at 1, thus no gap. As a curiosity, at N=32, it does 7.3MH/s under the same setup.

Quote from: WindMaster
Out of curiosity, how many hours after launch or after you started modifying your OpenCL kernel did it take you to make the changes?  Mainly a point of curiosity, for comparison with how many hours it took me.  I went from scratch rather than modifying the Reaper/cgminer kernel though, so my hour comparison will differ a bit because of that.

I was a bit late, I started coding the miner about 16h after the launch. It took me 13.5h from start, to a working implementation. It was very intensive, as you might imagine. Difficulty rising like no tomorrow, and my code only gave errors, until it finally worked. Debugging OpenCL code is horrible. :-)
sr. member
Activity: 347
Merit: 250
I'm getting the "checkpoint is too old" with the WindMaster's client as well, I have tried to fix it in the checkpoints.cpp but it didn't help...

I just pushed updates to my git repository to remove the "checkpoint too old" warning (no longer needed, coin is launched and stable), and added checkpoints at 30000, 45000 and 60000.  I'm showing the same block hash at 65000 that other people posted earlier today so we're all on the same blockchain.  For the person that asked why there should be more than one checkpoint rather than just the most recent checkpoint, it speeds up checking of the blockchain in the client.
legendary
Activity: 1484
Merit: 1005
I also have a working YAC kernel for my own miner, Reaper.
Shocked

Just want to know one thing: Do GPUs throw CPUs into the water?

I told everyone...
sr. member
Activity: 347
Merit: 250
for comparison with how many hours it took me.

So, correct me if I am wrong.
It appears that you are taking over development of YAC, and have modified a miner to be able to GPU mine YACoins  while the vast majority of users are only able to CPU mine.

Read earlier in the thread, I just wrote a non-Reaper/cgminer OpenCL kernel for this hashing algorithm at N=32 and benchmarked it.  I didn't integrate it into a miner.  I instead moved straight to an FPGA implementation instead since N=32 allowed a special opportunity for it to run very quickly on an FPGA implementation.  My GPU's are happily mining Litecoin.  My OpenCL kernel isn't at a stage that would be useful for mining.

To be clear, anyone that wants to screw with modifying Reaper or cgminer's OpenCL kernel is free to do so.  They're open source.  Download it and start hacking away.  I would imagine several people already have and succeeded, but we won't know for sure until everyone starts actually posting code.  Or who knows, maybe mtrlt is the only one who bothered?  We don't know at this point.

Anyway, it's not like I have insider knowledge or some unfair advantage here.  The source code for the scrypt-jane library used by the YAC client is open-source, and the source code for Reaper and cgminer are also open source, and there's even Wikipedia articles that spell out how scrypt and the salsa and chacha mixing functions work in a way that's easy to understand.  If technical skill at writing code is an unfair advantage, mtrlt has most of us beat there (me included) and has made a pretty plausible claim to have a working implementation (if anyone would've done it, mtrlt would have).  Of course, I'm not the original developer of YAC, I'm just the one stepping up to keep things rolling after the original developer went AWOL.  I didn't have any advance knowledge of what hashing algorithm would be used for YAC.  In fact, I slept through the coin launch and didn't become aware of it until 8 hours later, and didn't start mining until another 30 minutes after that after finishing waking up and getting sufficiently caffeinated.  Time will tell if stepping up and continuing to improve YAC just turns me into the YAC lightning rod though.

I suspect people think I have far more YAC than I actually do though.  I've bought more YAC on bter than I've mined, by no small margin.  If everyone is upset they didn't mine a lot of YAC, it doesn't matter because YAC is very inexpensive to buy on bter right now.


You can probably post that info safely without giving anyone a head-start on making the modifications themselves.

Also appears you don't wish to share that knowledge..?

I didn't optimize it, so my hash rates are pretty low.  So here's the dilemma.  Let's say I post a number for my unoptimized implementation, and later the YAC-enabled Reaper is released and it turns out it runs 10x faster.  Plenty of finger-pointing will occur just as happened with ArtForz when mtrlt released the version of Reaper that mined scrypt in Litecoin and the numbers blew ArtForz's (reported) Nvidia-based hash rate numbers out of the water.  We still see controversy over that to this day.

Anyway, at N=32, I benchmarked at ~360kH/sec on a 6950 (not overclocked) without any lookup gap, while my 4-year old IBM HS21 blade servers with 2x Xeon E5450's were cranking about 320kH/sec.  Not real far off of what mtrlt's kernel was doing on the same GPU for scrypt(1024,1,1) for Litecoin with a lookup gap of 2.  I didn't implement any other lookup gap so I have no idea what the % speed advantage is on GPU's for taking advantage of that TMTO shortcut.  And I imagine my hash rate was poor.  As I've said, I made no effort to optimize it beyond get something that would produce valid hashes.  Just wait for mtrlt to release actual real numbers.  If anyone optimized it well, he would've.  So, no finger-pointing if mtrlt achieved some totally ridiculously hash rate completely out of the ballpark of what I did.

That's why my GPU's are mining Litecoin, and a handful of my spare Xeon servers are mining YAC.
sr. member
Activity: 347
Merit: 250
How about adding network hashrate?

Good thought, added to my TODO list.
member
Activity: 107
Merit: 10
for comparison with how many hours it took me.

So, correct me if I am wrong.
It appears that you are taking over development of YAC, and have modified a miner to be able to GPU mine YACoins  while the vast majority of users are only able to CPU mine.

You can probably post that info safely without giving anyone a head-start on making the modifications themselves.


Also appears you don't wish to share that knowledge..?
efx
sr. member
Activity: 378
Merit: 250
512MB is overkill. A HD6970 has 1536 cores. One hash needing 512MB memory would mean a HD6970 would have to have 768 GB of memory, without a TMTO, which kills performance quite rapidly. I think the N increase should be capped way before 512MB is reached. Maybe 16MB?

Killing hash rate performance rapidly is the goal.  Why would we want to cap the N increase before 512MB is reached?  It'll be reached in the vicinity of 10 years from now, and I suspect no one is going to be bothering with today's 6xxx series Radeon GPU's a decade from now.  In my opinion, the rate of N increasing is actually probably a bit low, and N started at too low a number, if the original developer's intent was to level the playing field between GPU's and CPU's.  Starting N at a level where 512MB is needed to calculate a hash actually would've been an interesting approach right from the start of the coin.

Huh, for some reason I thought N was going up far more rapidly than that. Forgive my bad memory. Anyway, starting N at 512MB memory usage would have resulted in people being able to mine at something like 20 H/s (not tested, only approximately calculated), but I guess it would have been plenty of speed. Verifying blocks would have taken quite a long time though and would have eaten some of your hashrate, but again, it'd probably have been fine.

Quote
Quote
The same TMTO that works for LTC, works for YAC.

For clarification, are we talking about the TMTO shortcut currently used by cgminer for scrypt+salsa20/8, in which a lookup gap allows you to access external RAM half as often by adding an extra salsa round20/8 to calculate the missing value 50% of the time?

Yes. It works for scrypt, is not dependent on the mixing function, and is even more general than that. You can use any integer lookup_gap, and the memory usage will be 128*N/lookup_gap bytes, and the mixing function will be called 1/2*(lookup_gap+3)*N times, per thread on average. I know this because I actually wrote the LTC kernel used by cgminer, and I also have a working YAC kernel for my own miner, Reaper.


I recognized your name, lol.

I actually still use reaper and assumed it would be first to the gate with acceptable yak hashes.

Are you still involved with RS and SC/ 'microcash' or have you moved on? For hire?


Windmaster, you are tilting at a windmill here imo. Do you honestly want a network based on relatively simple core density and cheap memory? Even IF you could keep it to CPU only (of course, you can't) you are opening yourself up to some ...unfortunate long-term propositions.

Anyways, I'm just watching from the sidelines, I'm not really interested in hashing anything except (1024,1,1) right now  Wink
sr. member
Activity: 347
Merit: 250
For clarification, are we talking about the TMTO shortcut currently used by cgminer for scrypt+salsa20/8, in which a lookup gap allows you to access external RAM half as often by adding an extra salsa round20/8 to calculate the missing value 50% of the time?

Yes. It works for scrypt, is not dependent on the mixing function, and is even more general than that. You can use any integer lookup_gap, and the memory usage will be 128*N/lookup_gap bytes, and the mixing function will be called 1/2*(lookup_gap+3)*N times, per thread on average.

Indeed.  You passed the test to check if you know what you're talking about.  Smiley


I know this because I actually wrote the LTC kernel used by cgminer, and I also have a working YAC kernel for my own miner, Reaper.

Oh yeah, I thought I recognized the name mtrlt somewhere.  For anyone in the thread not familiar with scrypt GPU mining history, mtrlt's Reaper was the first (released) OpenCL implementation of scrypt back when everyone claimed scrypt was GPU-resistant.  The kernel in cgminer is just a rip-off of the one in Reaper.

You would indeed be one of the people I'd expect to have modified your own OpenCL code for scrypt+chacha fairly early on.  Anyway, willing to post some hash rate info for your kernel at the current N=128 for a given GPU type and lookup gap (if any)?  You can probably post that info safely without giving anyone a head-start on making the modifications themselves.

Out of curiosity, how many hours after launch or after you started modifying your OpenCL kernel did it take you to make the changes?  Mainly a point of curiosity, for comparison with how many hours it took me.  I went from scratch rather than modifying the Reaper/cgminer kernel though, so my hour comparison will differ a bit because of that.
Jump to: