Author

Topic: [ANN][RIC] Riecoin: constellations POW *CPU* HARD FORK successful, world record - page 142. (Read 685207 times)

legendary
Activity: 1092
Merit: 1000
How many people are planning to buy Riecoin as soon (if) it goes to .00001. I know I am.

Its dead Sad

o.o lol. Its far from dead if you read Gatra's announcement. Your account makes me lose any faith in your credibility.

Gatra is one competent developer but the coin is dead. I will be following his work, should he launch a new coin i'll mine-support it.
sr. member
Activity: 308
Merit: 250
Riecoin and Huntercoin to rule all!
How many people are planning to buy Riecoin as soon (if) it goes to .00001. I know I am.

Its dead Sad

o.o lol. Its far from dead if you read Gatra's announcement. Your account makes me lose any faith in your credibility.
legendary
Activity: 1092
Merit: 1000
How many people are planning to buy Riecoin as soon (if) it goes to .00001. I know I am.

Its dead Sad
sr. member
Activity: 308
Merit: 250
Riecoin and Huntercoin to rule all!
How many people are planning to buy Riecoin as soon (if) it goes to .00001. I know I am.
sr. member
Activity: 308
Merit: 250
Riecoin and Huntercoin to rule all!
I know this is late but riecoin.org is updated to match updated client and miner version.
newbie
Activity: 16
Merit: 0
The first choice I made was to use 32bit everywhere (64bit arithmetic is expensive).
Yes of course. In the sieving code I use 32 bit too, except for the sieve setup, which requires long (256 bit) arithmetic.
Quote
This means each thread only searches at most (nonce -> nonce+2^32)
so I don't believe I should miss any, so I am not sure why a correction factor would be required.
I search from nonce to nonce+2^255 with a sieve of 2^31 bit size, and to do that in my sieve every bit represents a number 16057 + 173# * n, where 16057 is the offset of the second sixtuple and 173# is the primorial I use. This of course leaves large gaps between numbers represented by the sieve and I wrongly assumed your code would be similar.
Quote
Yes, memory bandwidth is a big issue in ARM as well.
It's not only bandwith (mainly reading a big precalculated table with primes and modulo inverses, which the x86 does indeed pretty well) but the latency of the pseudo random accesses when the primes get bigger than the table, each one amounting to 60 ns (300 clocks cycles!). Nevertheless i found it useful if in the end the blocks comes faster :-)
sr. member
Activity: 308
Merit: 250
Riecoin and Huntercoin to rule all!
Hi people!
I'm still short on time... other members of the community offered help to I'll be working with them.

My problem debugging (the code I hacked based on) dga's miner is that using Eclipse + gdb, after the mining threads are created it stops responding for a while, making debugging very hard.

Great! Destroying Ypool monopolization is a nice goal to have
hero member
Activity: 583
Merit: 505
CTO @ Flixxo, Riecoin dev
Hi people!
I'm still short on time... other members of the community offered help to I'll be working with them.

My problem debugging (the code I hacked based on) dga's miner is that using Eclipse + gdb, after the mining threads are created it stops responding for a while, making debugging very hard.
dga
hero member
Activity: 737
Merit: 511

I'll need to look at dga's code to come up with a method to map relative performance; question is does dga's miner look for any 4 out of 6 or does it look for any chain (contiguous) of 4?


The formula dga's miner uses is the 1st plus any three of the remaining 5.  So all other things being equal the scaling factor should be 10 (please check my numbers, red wine and all...).  You'll need to multiply the p4's I report by 10 and divide by the time taken.

Regards,

--
bsunau7

Correct.  I coordinated this definition with jh00 of ypool to ensure that the definition of a share was consistent, so if you need to change it for some reason, just be careful with the pool interaction whichever way you go.
member
Activity: 114
Merit: 10
Yes, I noticed the the "time to block" decreased to a third when using -m 5M. In my miner (for intel hardware though) I use all primes up to 2^32, so I was used to some larger numbers :-) Note that the time to block displayed my be serverly off. You need to apply a correction depending primorial you use, which will increase the likelyhood of finding a sextuplet, but OTOH you do miss some sextuplets if your primorial is bigger than 210 (and of course it is).

The first choice I made was to use 32bit everywhere (64bit arithmetic is expensive).  This means each thread only searches at most (nonce -> nonce+2^32)
sieving with larger primes has low chance of pruning the candidate.  It just becomes cheaper to stop sieving primes (diminishing returns) and start fermat testing.

I don't use the pnXX+97 step, but test all 5005 candidates per pn19, so I don't believe I should miss any, so I am not sure why a correction factor would be required.

Quote
Mertens 3rd theorem fits very well on this kind of sieves, it does give exactly the probability of not sieving a number if you sieve with all the primes up to n. You surely noticed the constant factor between p0/p1/p2... This factor can be calculated as the quotient from the 3rd mertens function with n as the sieve size and the prime density at the given difficulty, which is about 1/ln(2^(diff+256+8+1)). This factor raised to the sixt power (for p6) decreases significantly with the size of the primes used, so it might be worth using primes as big a possible, if your sieve is fast enough (but on an arm 2^32 is probably to big).

Yes I did, ~34-35 is the observed factor (@ current difficulty and sieving with the first 550k primes) and I use it outside of the miner to estimate time to block.  In the miner I still use the original gatra code for the time estimate, it is ~10% of what it should be which isn't a big concern (ie low priority).

I think the sieve code is pretty fast, but it is only fast because I stick to the ARM's word size.  A sieve of a few hundred thousand primes will be a natural limit for a 32bit implementation, until the difficulty increases (when fermat testing gets significantly more expensive) I don't see any benefit in removing the 32bit design limit.

Quote
The problem I found with the sieve after I eliminated all the modulo operations was that it is limited by memory bandwidth. To make more than one core do useful work on it, I needed to divide it into small chunks which fit into the caches.

Yes, memory bandwidth is a big issue in ARM as well.  Not only does it use slower memory, hardware pre-fetch is not as advanced as x86, reading & writing to the same cache line has a penalty as does misalignment.  My first few miners were all running at the same general speed no matter what I did, once I started minimizing memory access I was able to get some good gains.

Regards,

--
bsunau7
newbie
Activity: 16
Merit: 0
I suspect using a tenth will more than double overall performance (reducing it even more should improve it further).
Yes, I noticed the the "time to block" decreased to a third when using -m 5M. In my miner (for intel hardware though) I use all primes up to 2^32, so I was used to some larger numbers :-) Note that the time to block displayed my be serverly off. You need to apply a correction depending primorial you use, which will increase the likelyhood of finding a sextuplet, but OTOH you do miss some sextuplets if your primorial is bigger than 210 (and of course it is).
Mertens 3rd theorem fits very well on this kind of sieves, it does give exactly the probability of not sieving a number if you sieve with all the primes up to n. You surely noticed the constant factor between p0/p1/p2... This factor can be calculated as the quotient from the 3rd mertens function with n as the sieve size and the prime density at the given difficulty, which is about 1/ln(2^(diff+256+8+1)). This factor raised to the sixt power (for p6) decreases significantly with the size of the primes used, so it might be worth using primes as big a possible, if your sieve is fast enough (but on an arm 2^32 is probably to big).
I was thinking about a dynamic balancing the amount of sieving  based on the relative performance of the fermat tests
The problem I found with the sieve after I eliminated all the modulo operations was that it is limited by memory bandwith. To make more than one core do useful work on it, I needed to divide it into small chunks which fit into the caches.
sr. member
Activity: 308
Merit: 250
Riecoin and Huntercoin to rule all!
Any uppdate this week Gatra?
member
Activity: 114
Merit: 10
hmmmm ... I'll try over the weekend,
but 2 years - even improved by a factor 20  means more than a month for a block ... not very encouraging ...


My gut says his hardware should be about 4 time quicker with a better (smaller) sized sieve.  So only 5 times better to go (and there are 3 or 4 things which should get me significantly closer if I can pull it off).

Regards,

--
bsunau7
sr. member
Activity: 259
Merit: 250
hmmmm ... I'll try over the weekend,
but 2 years - even improved by a factor 20  means more than a month for a block ... not very encouraging ...
member
Activity: 114
Merit: 10
With -m 100000000 on a nexus 4:

Thanks!

With that -m option you are sieving the first 5.7million primes which is about 10 times larger than I would suggest.  I suspect using a tenth will more than double overall performance (reducing it even more should improve it further).  While your output was truncated the overall goal is to get the total time per thread to it's minimum.  Right now you'd be spending ~2 years for a block (i.e not worth it, but playing with -m should improve that significantly).

Some other notes, the sieve makes use of 128bit wide NEON instructions (most premium phones will have that).  All sieves are impacted by memory performance (and phone memory is slow), the nexus 4 has a smaller cache than my system so memory access will play a bigger role in determining your performance (which is why making the sieve even smaller might help).  Also I don't know enough about the Snapdragon processor, but most older ARM CPU's have very bad memory prefetch.

Right now the sieve time should be pretty constant but the fermat tests will vary based on difficulty, I was thinking about a dynamic balancing the amount of sieving  based on the relative performance of the fermat tests.  As difficulty drops less sieving, as it rises more sieving.  It should also help balance architectural differences in the CPUs in addition to difficulty levels.

Thanks again,

--
bsunau7
member
Activity: 114
Merit: 10
xpm aborts if the first number in the 6tuple is a composite (which is true most of the time), for the remaining 5 numbers the primes are counted. If I assume the search for a sixtuple in the arm miner is aborted when the first composite number is met (like in cpuminer-rminerd), you can multiply the 3ch and the 4ch numbers by about 10. This is because you have 10 choices to select 2 out of 5 (or 3 out of 5), 5!/(2!*3!) = 10.

Sorry didn't see your response.  Yes a linear test of p6 chain, as soon as I fail I abort.  For a solo miner it does not make sense to keep testing when you know a block can't be found, the ypool reward structure is different which is why different metrics are used.

--
bsunau7
member
Activity: 114
Merit: 10
I am looking for people with linux running on ARM processors to try a solo miner...

Does not run on Android Linux. Unresolved library dependencies. Can you build a static executable or an executable linked to bionic?

Edit: It's not only glibc, there are also a lot of undefined symbol from gmp. So a statically linked executable is needed, otherwise it will run only on your platform and nowhere else.

Tried a quick -static build and well libcurl dependencies need to be resolved.  I'll have some time in a few hours and will redouble my efforts!


A static version has been uploaded.  A minimal libcurl did the trick (was not looking forward to the Android NDK).

Regards,

--
bsunau7
member
Activity: 114
Merit: 10

I'll need to look at dga's code to come up with a method to map relative performance; question is does dga's miner look for any 4 out of 6 or does it look for any chain (contiguous) of 4?


The formula dga's miner uses is the 1st plus any three of the remaining 5.  So all other things being equal the scaling factor should be 10 (please check my numbers, red wine and all...).  You'll need to multiply the p4's I report by 10 and divide by the time taken.

Regards,

--
bsunau7
member
Activity: 114
Merit: 10
is this an open source miner ?

if yes i may try it out can you please point me to github or where the source is .. thanks


I might consider making n-1 type releases open, but it is not on my agenda right now.  From a development viewpoint I doubt it'll do much (anything?) to improve/speed up development cycles.

Regards,

--
bsunau7
Jump to: