[ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 52.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: warcries on November 30, 2017, 01:14:38 AM

Quote from: joblo on November 29, 2017, 08:32:25 AM

Quote from: warcries on November 29, 2017, 02:03:41 AM

@joblo

the program is working fine in my windows 10 but when I ran it in my windows server 2012. Err is program not responding.

More info please It's probably your CPU.

I'm using Intel Xeon x3430(Lynfield) in windows server 2012. Err is program not responding.

thank you.

Did you read README.txt? There is a very interesting line:

Quote

Choose the exe that best matches you CPU's features or use trial and
error to find the fastest one that doesn't crash.

warcries

newbie

Activity: 4

Merit: 0

Quote from: joblo on November 29, 2017, 08:32:25 AM

Quote from: warcries on November 29, 2017, 02:03:41 AM

@joblo

the program is working fine in my windows 10 but when I ran it in my windows server 2012. Err is program not responding.

More info please It's probably your CPU.

I'm using Intel Xeon x3430(Lynfield) in windows server 2012. Err is program not responding.

thank you.

joblo

legendary

Activity: 1470

Merit: 1114

I'm giving up on lyra2z 4way. It wasn't about lyra2z, the gain turned out to be only 2 % with
rejects.

The real point was to test blake256 as a step toward other algos. It's also used by cryptonight
and lyra2rev2.

With the whirlpool problem that's 2 failures in 2 days. it's a good thing i don't have a boss or
customers to answer to.

The problem with lyra2z is one of the weirdest I've ever encountered. I will probably revisit this
in the future when I am able to test the other algos that use blake256. For now I'll move forward
with other algos that build on the work done for tribus and nist5.

joblo

legendary

Activity: 1470

Merit: 1114

Here's a puzzle for coding experts.

I was testing with both sph and 4way running side by side and comparing the hash.
Everything was fine. Then I started cleaning up the code and the hash broke. What remains
is the last bit of code I can't remove without breaking the hash. I left a couple of comented out
lines for context (no pun intended). The code as presented works. If I remove the line indicated
the hash breaks and it only submits invalid shares that are rejected. It should be noted that
blake_ctx was never initialized nor was sph_blake256 run before blake256_close so close is
running with random data. Both variables are local and are not referenced anywhere else.

I would suspect local stack corruption but in reverse. Instead of code corrupting the stack,
removing code does.

The input data is 4 80 byte streams interleaved for blake_4way.
vhash is 4 32 byte hash streams returned from blake_4way interleaved.
hash0..3 is vhash deinterleaved for lyra2 to be run serially.
hash and ctx_blake are not in any way involved in the proper functioning of the code.

I'm stumped. Anyone have any insight?

Edit:
I tried nulling sph256_close but it failed. It seems to be dependent on actually running the code
in the function.

I moved the funky code to the end of the function and everything still works. But still, if I remove it
the returned hash is invalid. SPH is stable code and not likely to be accessing data it shouldn't.
Even if it did it would break something, not fix it. It's not even being used properly. There should be
no interactions between the sph code and the 4way code, they have their own data structures and supporting
functions and don't share anything

I'm even more stumped.

Code:

void lyra2z_hash_4way( void *state, const void *input )
{
     uint32_t hash0[8] __attribute__ ((aligned (32)));
     uint32_t hash1[8] __attribute__ ((aligned (32)));
     uint32_t hash2[8] __attribute__ ((aligned (32)));
     uint32_t hash3[8] __attribute__ ((aligned (32)));
     uint32_t vhash[8*4] __attribute__ ((aligned (64)));
     blake256_4way_context ctx __attribute__ ((aligned (64)));

uint32_t _ALIGN(64) hash[8];
sph_blake256_context ctx_blake __attribute__ ((aligned (64)));
//memcpy( &ctx_blake, &lyra2z_blake_mid, sizeof lyra2z_blake_mid );
//sph_blake256( &ctx_blake, input + 64, 16 );
// removing the following line breaks the hash
sph_blake256_close( &ctx_blake, hash );

     memcpy( &ctx, &ctx_mid, sizeof ctx_mid );
     blake256_4way( &ctx, input + (64<<2), 16 );
     blake256_4way_close( &ctx, vhash );

     m128_deinterleave_4x32( hash0, hash1, hash2, hash3, vhash, 256 );

     LYRA2Z( lyra2z_wholeMatrix, hash0, 32, hash0, 32, hash0, 32, 8, 8, 8);
     LYRA2Z( lyra2z_wholeMatrix, hash1, 32, hash1, 32, hash1, 32, 8, 8, 8);
     LYRA2Z( lyra2z_wholeMatrix, hash2, 32, hash2, 32, hash2, 32, 8, 8, 8);
     LYRA2Z( lyra2z_wholeMatrix, hash3, 32, hash3, 32, hash3, 32, 8, 8, 8);

     memcpy( state   , hash0, 32 );
     memcpy( state+32, hash1, 32 );
     memcpy( state+64, hash2, 32 );
     memcpy( state+96, hash3, 32 );
}

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: fynxgloire on November 29, 2017, 10:52:02 AM

Hi,
What is the best bang for the buck Xeon processor to go with the H110 Pro BTC+ motherboard?
or
Can an Intel Core i7-8700K CPU work with this motherboard?

regards

System building recommendations deserve their own thread, I'd rather keep this one about
cpuminer software.

That being said If you're tryng to build a combo GPU/CPU rig it's entirely feasible. CPU choice
depends on the features of the various CPU architectures. There are several threads already discussing
the benefits of different architectures and features for CPU mining.

fynxgloire

full member

Activity: 294

Merit: 100

Hi,
What is the best bang for the buck Xeon processor to go with the H110 Pro BTC+ motherboard?
or
Can an Intel Core i7-8700K CPU work with this motherboard?

regards

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: warcries on November 29, 2017, 02:03:41 AM

@joblo

the program is working fine in my windows 10 but when I ran it in my windows server 2012. Err is program not responding.

More info please It's probably your CPU.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: nizzuu on November 29, 2017, 01:54:39 AM

Seems one of the questions was lost in the thread(

Sample usage:

cpuminer-aes-avx2 -a lyra2z330 -t 2 --benchmark

First hashrate output is showed after ~7-8minutes on i5-7600 (860+ h/s), and ~15minutes on a slower (450+ h/s) pentium, but the appropriate cpu utilization starts immediately.

Tried new 4way nist5, tribus - speed is showed immediately, as well as on lyra2z. Why the first output is so slow? It's a real pain to benchmark...

Interesting observation. There is nothing unique about how cpuminer handles lyra2z330 vs other algos. Lyra2z330 is, however,
unique as the slowest hashing algo. it also has to do a little more work on sttartup (malloc) that others algos don't do. But that
doesn't take minutes.

It might be worthwhile to pay attention to the hash count. Is it proportional to the time? I'm not sure what else to suggest.

BTW lyra2z330 will not benefit from 4way. It is pure lyra2 which is already using AVX2 horizontally. Vertical (4way) AVX2 would
not likely affect compute performance. Furthemore lyra2z is I/O bound (memory hard) so improving compute performance just
means the CPU would spend more time stalled waiting on data from memory.

warcries

newbie

Activity: 4

Merit: 0

@joblo

the program is working fine in my windows 10 but when I ran it in my windows server 2012. Err is program not responding.

nizzuu

full member

Activity: 187

Merit: 100

Cryptocurrency enthusiast

Seems one of the questions was lost in the thread(

Sample usage:

cpuminer-aes-avx2 -a lyra2z330 -t 2 --benchmark

First hashrate output is showed after ~7-8minutes on i5-7600 (860+ h/s), and ~15minutes on a slower (450+ h/s) pentium, but the appropriate cpu utilization starts immediately.

Tried new 4way nist5, tribus - speed is showed immediately, as well as on lyra2z. Why the first output is so slow? It's a real pain to benchmark...

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: spider703 on November 28, 2017, 05:58:30 PM

cpuminer-4way not working on my i7-3770

From README.txt:

Quote

4way requires a CPU with AES and AVX2.

Your CPU is Ivybrige, no AVX2.

spider703

full member

Activity: 1890

Merit: 148

cpuminer-4way not working on my i7-3770

joblo

legendary

Activity: 1470

Merit: 1114

cpuminer-opt-3.7.4 is released.

Added 4 way support for tribus and nist5.

Removed some unnecessary compile options.

A 4-way Windows binary is now available.

I'm waiting for someone to get the bonus.The bonus if if one thread can fine more than one nonce in parallel.
It's very rare and I haven't seen it yet but the code checks for it to make sure second, third or even fourth
nonces are submitted. It's almost like a lotto but you don't win anything. The multiple nonces are all part
of the odds.

git: https://github.com/JayDDee/cpuminer-opt

tarball: https://drive.google.com/file/d/1AwdqMWFufxZmuKWKHkWCjlfRm0SPqID8/view?usp=sharing

Windows binaries: https://drive.google.com/file/d/1opN5Wb5tL9_wes8RsZ6QSftOOo2Uhb6p/view?usp=sharing

joblo

legendary

Activity: 1470

Merit: 1114

I've encountered my first major roadblock with 4-way with whirlpool.

The core of whirlpool is a table lookup but the table index is a variable meaning each lane in the vector
uses a different index, ie each lane reads a different address. This operation is not efficient with SIMD as
it needs to load one 64 bit element from 4 different addresses. Although there is a SIMD instruction to do this
it is very expensive with an optimum throughput of 4 to read 4 items. That's no faster than performing
the operation with scalar instructions. When the 4-way overhead is added it hashes significantly slower
than the old way.

I suspect GPUs don't have this problem because each lane has it's own dedicated core with it's own local memory.
All memory accesses can run in parallel with different addresses. On a CPU 4 lanes run on the same core
accessing data from 4 addresses from the same memory system, serially.

This looks like an architectural issue that can't be overcome.

This will affect algos like x15, xevan & m7m which will gain less than previously anticipated.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: nizzuu on November 27, 2017, 03:08:24 AM

Quote from: joblo on November 26, 2017, 05:12:27 PM

Do I go with Ryzen and just SHA or wait for Cannonlake with SHA and AVX-512?

Well, Ryzens have 2x128-bit wide AVX units instead of 256, don't forget about it Wink

I think this is not a good implementation to target to.

As for AVX-512, the only adequate choice for now is i7-7800X, it's not so expensive but has 140W TDP Tongue

(and liquit ship inside instead of solder).

Yes Ryzen's implementation of AVX2 is inferiour. But AVX2 and AVX512 don't improve a CPU's competitive disadvantage
to GPUs. SHA does and is available now with Ryzen. If Cannonlake would come out in summer I could wait for it but as the
release gets delayed it makes a Ryzen purchase more likely.

nizzuu

full member

Activity: 187

Merit: 100

Cryptocurrency enthusiast

Quote from: joblo on November 26, 2017, 05:12:27 PM

Do I go with Ryzen and just SHA or wait for Cannonlake with SHA and AVX-512?

Well, Ryzens have 2x128-bit wide AVX units instead of 256, don't forget about it Wink

I think this is not a good implementation to target to.

As for AVX-512, the only adequate choice for now is i7-7800X, it's not so expensive but has 140W TDP Tongue

(and liquit ship inside instead of solder).

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: felixbrucker on November 26, 2017, 06:28:28 PM

why not both Tongue

Timing. Cannonlake is delayed until end of 2018 now, still possible for more delays. A Ryzen purchase
could be done in spring when the next Ubuntu LTS is released.

Both does have some advantages. I could get the Ryzen earlier before next LTS and then Cannonlake
delays don't matter.

felixbrucker

hero member

Activity: 700

Merit: 500

why not both Tongue

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: nizzuu on November 26, 2017, 02:47:14 PM

Hi, this may be useful as well: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI2&expand=3773

AVX-512F section as well, but I have no supporting cpu :-( So I can't test any benefints as compared to AVX-2. They should be, but...

LOL. That page is permanently open in my browser.

I was wondering when someone would mention AVX-512.
I'm already dreaming about 8-way. It should be easier than going from 1-way to 4-way. That makes my next
CPU a difficult choice. Do I go with Ryzen and just SHA or wait for Cannonlake with SHA and AVX-512?

nizzuu

full member

Activity: 187

Merit: 100

Cryptocurrency enthusiast

Quote from: joblo on November 25, 2017, 04:07:38 PM

Large pages has already been done for cryptonight. I'm doing something that hasn't been done yet.
Large pages for cpuminer-opt will have to wait, though it could benefit a couple of memory hard algos.

Hi, this may be useful as well: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI2&expand=3773

AVX-512F section as well, but I have no supporting cpu :-( So I can't test any benefints as compared to AVX-2. They should be, but...

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 52. (Read 444122 times)