Author

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 173. (Read 444067 times)

legendary
Activity: 1470
Merit: 1114
A number of users have reported problems with AMD CPUS that don't have AES_NI but do
have SSE2.

Problems include compile failures and error exits on startup due to a perceived lack of support
for SSE2.

Solving this problem would go quicker with better info from the users when reporting problems.
This applies to any problem, not just the current AMD issues. As a result here are some tips for
problem reporting.

Give some info about your environment, CPU, OS, etc.

In addition to a description of the problem show the problem. Post the console session where the problem occurred
showing the command entered and the output produced.

Is it a new problem, ie did it work before?

Have you deviated from the recommended or previously used procedure or is there any change in the environment?

Have you tried to solve or workaround the problem yourself? How?

Provide info specific to the problem. In this case run the folllowing command and post the output:

gcc -march=native -Q --help=target | fgrep march

legendary
Activity: 1470
Merit: 1114
Tried it out to delete the sourcecode line, then "SW built for SSE2..........NO." change to "YES", but the miner stops working immediately; tried also several -march versions

More info please. What exactly did you change? I didn't suggest changing no to yes. What does stopped working mean?
Did it compile?, did it crash? did it exit cleanly?
full member
Activity: 192
Merit: 100
Tried it out to delete the sourcecode line, then "SW built for SSE2..........NO." change to "YES", but the miner stops working immediately; tried also several -march versions

same here with: Sempron145 CPU, configure und make with no mistakes

Code:
         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Going after the algos would be daunting as each code segment would have to be analyzed individually. Modifying the scanning
engine to process two, or more, nonces in parallel might give bigger gains at lower effort.

How does ccminer do it in cuda?

it's pretty basic.
there is a for cycle with step = number of threads.
just divide the number of threads by the nonces per thread when running the kernel, and make the single thread process more nonces.
you can even do it all in the algo specific file (I did it for decred), without touching the main code.
legendary
Activity: 1708
Merit: 1049
Useful link for replacing slow & obsolete implementations: http://bench.cr.yp.to/primitives-hash.html

Perhaps if one googles algo by algo, they can find even better (?).
legendary
Activity: 1470
Merit: 1114
Quote from: joblo
What this message is supposed to mean is although the CPU supports SSE2 it wasn't compiled in.
This should only occur is if you specify an arch that doesn't support SSE2. Which arch did you use?

Same as always:
Code:
./autogen.sh && ./configure CFLAGS="-O3 -march=btver1" --with-curl --with-crypto && make
I am 100% sure that btver1 includes SSE2

Quote from: joblo
It looks like AMD is going to be a challenge. As AMD users I'll leave it up to you guys to figure out the workarounds
for -march for various CPUs. I can then add notes to the README

The same commandline worked for AMD since cpuminer-opt-3.1.9 and now does'nt on cpuminer-opt-3.1.16.
Last version it worked was cpuminer-opt-3.1.15, so something is changed between them.

Also the compile output messages is really small, even in Intel CPU in cpuminer-opt-3.1.16.

I suspect the SSE2 SW check isn't working. Did you try to override it? If it works with the override I'll remove the check
permanently.

Edit: The override should work because the compile succeeded. If the compiler was truly compiling a non-SSE2 arch
it would have failed on the SSE2 instructions. It would seem the __SSE2__ compiler macro is unreliable. I may remove
the check completely or make it non-fatal.
sr. member
Activity: 312
Merit: 250
Quote from: joblo
What this message is supposed to mean is although the CPU supports SSE2 it wasn't compiled in.
This should only occur is if you specify an arch that doesn't support SSE2. Which arch did you use?

Same as always:
Code:
./autogen.sh && ./configure CFLAGS="-O3 -march=btver1" --with-curl --with-crypto && make
I am 100% sure that btver1 includes SSE2

Quote from: joblo
It looks like AMD is going to be a challenge. As AMD users I'll leave it up to you guys to figure out the workarounds
for -march for various CPUs. I can then add notes to the README

The same commandline worked for AMD since cpuminer-opt-3.1.9 and now does'nt on cpuminer-opt-3.1.16.
Last version it worked was cpuminer-opt-3.1.15, so something is changed between them.

Also the compile output messages is really small, even in Intel CPU in cpuminer-opt-3.1.16.
legendary
Activity: 1708
Merit: 1049
No idea, haven't looked into CUDA mining.
legendary
Activity: 1470
Merit: 1114
May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

That's a fascinating idea but I don't think it will get the visibility here that it deserves. Pooler and TPruvot are the two main guys for
CPU mining although TPruvot is focussed more on other projects at the moment. Both have active threads in this forum. I suggest you
present your idea to them in case they, or their folllowers, may want to take on the challenge. It's beyond my skill level.

It's ok, don't worry. Some people reading this thread will know what to do with it.

I'm not in altcoin mining really as I don't have the hardware and I'm not in the mood of renting. Obviously there's a lot of money here for optimized miners that are doing multiple hashrates than the ordinary ones. But this idea also extends to scaling of bitcoin and altcoins for things like cryptographic verification etc. They are using serial functionality when it could be done in packs of 2 or 4 (or 8 in something like ...AVX3-4-5 - or AVX512 which already exists).

Going after the algos would be daunting as each code segment would have to be analyzed individually. Modifying the scanning
engine to process two, or more, nonces in parallel might give bigger gains at lower effort.

How does ccminer do it in cuda?
legendary
Activity: 1470
Merit: 1114
Code:
         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?

What this message is supposed to mean is although the CPU supports SSE2 it wasn't compiled in.
This should only occur is if you specify an arch that doesn't support SSE2. Which arch did you use?

You can override the error by commenting out the exit statement below and recpompiling. Note
cpuminer may crash with the override if the message was correct.

cpu-miner.c function check_cpu_capability line#2700
Code:
         // make sure CPU has at least SSE2
         printf("   CPU arch supports SSE2.....");
         if ( cpu_has_sse2 )
         {
            printf("%s\n", grn_yes );
            printf("   SW built for SSE2..........");
            if ( sw_has_sse2 && !sw_has_aes )
            {
                printf("%s\n", grn_yes );
                printf_mine_without_aes();
            }
            else
            {
                printf("%s\n", ylw_no );
                printf_bad_build();
                exit(1);                            <-------- delete or comment this line

It looks like AMD is going to be a challenge. As AMD users I'll leave it up to you guys to figure out the workarounds
for -march for various CPUs. I can then add notes to the README
legendary
Activity: 1708
Merit: 1049
It's not a new idea. It was used back in the GPU bitcoin mining days to get better speed on amd VLIW cards.
It's easy to adapt the miner itself to process multiple nonces per thread, not sure about how much work is needed to work on the algos themselves. Maybe we could make a test with a simple algo like blake. But I'm not the man because I'm not proficient in those cpu instruction extensions.

Neither am I, but it's not that difficult.

Say for example you have a loop like:


for (i = 0; i <100000000; i++)
   b=sqrt (b);
   bb=sqrt(bb);
   bbb=sqrt(bbb);
   bbbb=sqrt(bbbb);


...gcc will make it something like:

40072e:   0f 84 9b 00 00 00       je     4007cf
  400734:   f2 0f 51 d6             sqrtsd %xmm6,%xmm2
  400738:   66 0f 2e d2             ucomisd %xmm2,%xmm2
  40073c:   0f 8a 63 02 00 00       jp     4009a5
  400742:   66 0f 28 f2             movapd %xmm2,%xmm6
  400746:   f2 0f 51 cd             sqrtsd %xmm5,%xmm1
  40074a:   66 0f 2e c9             ucomisd %xmm1,%xmm1
  40074e:   0f 8a d9 01 00 00       jp     40092d
  400754:   66 0f 28 e9             movapd %xmm1,%xmm5
  400758:   f2 0f 51 c7             sqrtsd %xmm7,%xmm0
  40075c:   66 0f 2e c0             ucomisd %xmm0,%xmm0
  400760:   0f 8a 47 01 00 00       jp     4008ad
  400766:   66 0f 28 f8             movapd %xmm0,%xmm7
  40076a:   f2 0f 51 c3             sqrtsd %xmm3,%xmm0
  40076e:   66 0f 2e c0             ucomisd %xmm0,%xmm0
  400772:   0f 8a b5 00 00 00       jp     40082d

...which is sqrt-scalar-double.

4 instructions / 4 math operations.

What could be done differently (intel syntax follows):

     movlpd xmm1, b      //loading the first variable "b" to the lower part of xmm1
     movhpd xmm1, bb     //loading the second variable "bb" to the higher part of xmm1
     SQRTPD xmm1, xmm1   //batch processing both variables for their square root, with one SIMD command
     movlpd xmm2, bbb    //loading the third variable "bbb" to the lower part of xmm2
     movhpd xmm2, bbbb   //loading the fourth variable "bbbb" to the higher part of xmm2
     SQRTPD xmm2, xmm2   //batch processing their square roots
     movlpd b, xmm1      //
     movhpd bb, xmm1     // Returning all results from the register back memory
     movlpd bbb, xmm2    //
     movhpd bbbb, xmm2   //

SQRTPD - Square root - P(acked)-Double.

So now 4 maths instructions became 2 and the time got down in half (I've actually benchmarked the above and it goes near half). But in order to pack instructions (math or logical) you need to have similar processing load, similar operations. You can't have that in a scenario where it goes like

sqrt
add
shift
xor

and the function is changing...

But if you loaded 4x hashes together, you'd be looking at

sqrt(of the first) sqrt (of the second) sqrt (third) sqrt (fourth) (<=pack them)
add add add add (<=pack them)
shift shift shift shift (<=pack them)
xor xor xor xor (
...etc

I wasn't even aware of the above, until a couple of weeks ago when I got down to asm level to see what happens and why some Pascal output was slower than C output... then I run into http://x86.renejeschke.de as a reference where I was trying to understand the instructions and what they are doing, and then rewrote some instructions myself - like the above with the packed (I thought it was pretty easy really) and then, more recently, I went over the code of the asm hash functions of altcoins and bitcoin - and it was full of serial operations, despite "SSE/AVX use" / "SSE/AVX enhanced". And I'm like WHAT THE F***? This is all crippled.
legendary
Activity: 1708
Merit: 1049
May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

That's a fascinating idea but I don't think it will get the visibility here that it deserves. Pooler and TPruvot are the two main guys for
CPU mining although TPruvot is focussed more on other projects at the moment. Both have active threads in this forum. I suggest you
present your idea to them in case they, or their folllowers, may want to take on the challenge. It's beyond my skill level.

It's ok, don't worry. Some people reading this thread will know what to do with it.

I'm not in altcoin mining really as I don't have the hardware and I'm not in the mood of renting. Obviously there's a lot of money here for optimized miners that are doing multiple hashrates than the ordinary ones. But this idea also extends to scaling of bitcoin and altcoins for things like cryptographic verification etc. They are using serial functionality when it could be done in packs of 2 or 4 (or 8 in something like ...AVX3-4-5 - or AVX512 which already exists).
full member
Activity: 192
Merit: 100
same here with: Sempron145 CPU, configure und make with no mistakes

Code:
         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?
sr. member
Activity: 312
Merit: 250
Code:
         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
It's not a new idea. It was used back in the GPU bitcoin mining days to get better speed on amd VLIW cards.
It's easy to adapt the miner itself to process multiple nonces per thread, not sure about how much work is needed to work on the algos themselves. Maybe we could make a test with a simple algo like blake. But I'm not the man because I'm not proficient in those cpu instruction extensions.
sr. member
Activity: 312
Merit: 250
Can you run this command on your AMD processors and show me the output?
Code:
gcc -march=native -Q --help=target | fgrep march

Here you are:
Code:
root@beast:~$ gcc -march=native -Q --help=target | fgrep march
  -march=                               amdfam10

Then your build should have worked out of the box, with -march=native. amdfam10 doesn't have AES support and gcc won't define __AES__ macro. Can you try building it with -march=native again?

This curious. I presume that shows which arch is used by native.

On my skylake I get core2-avx and on my haswell I get corei7-avx.
configure fails with -march=skylake on my skylake.

Yes, this shows which arch gcc use for -march=native.

In your case skylake is too new for gcc, your gcc 4.8.4 doesn't know about it, it should choose the closest match with most features enabled. There's no 'core2-avx' in GCC 4.8.4 manual, maybe you meant 'core-avx2'? core-avx2 defines __AES__ automatically.

https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/i386-and-x86-64-Options.html

Nope...
Still ./build.sh fails on AMD
Maybe I am to try something else?
legendary
Activity: 1470
Merit: 1114
May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

That's a fascinating idea but I don't think it will get the visibility here that it deserves. Pooler and TPruvot are the two main guys for
CPU mining although TPruvot is focussed more on other projects at the moment. Both have active threads in this forum. I suggest you
present your idea to them in case they, or their folllowers, may want to take on the challenge. It's beyond my skill level.

legendary
Activity: 1708
Merit: 1049
May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html
legendary
Activity: 1470
Merit: 1114
Can you run this command on your AMD processors and show me the output?
Code:
gcc -march=native -Q --help=target | fgrep march

Here you are:
Code:
root@beast:~$ gcc -march=native -Q --help=target | fgrep march
  -march=                               amdfam10

Then your build should have worked out of the box, with -march=native. amdfam10 doesn't have AES support and gcc won't define __AES__ macro. Can you try building it with -march=native again?

This curious. I presume that shows which arch is used by native.

On my skylake I get core2-avx core-avx2 and on my haswell sandy bridge I get corei7-avx.
configure fails with -march=skylake on my skylake.

Yes, this shows which arch gcc use for -march=native.

In your case skylake is too new for gcc, your gcc 4.8.4 doesn't know about it, it should choose the closest match with most features enabled. There's no 'core2-avx' in GCC 4.8.4 manual, maybe you meant 'core-avx2'? core-avx2 defines __AES__ automatically.

https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/i386-and-x86-64-Options.html

Hmage, you're a good teacher and you know your stuff, I'm learning a lot. Core-avx2 correction noted.
member
Activity: 83
Merit: 10
Can you run this command on your AMD processors and show me the output?
Code:
gcc -march=native -Q --help=target | fgrep march

Here you are:
Code:
root@beast:~$ gcc -march=native -Q --help=target | fgrep march
  -march=                               amdfam10

Then your build should have worked out of the box, with -march=native. amdfam10 doesn't have AES support and gcc won't define __AES__ macro. Can you try building it with -march=native again?

This curious. I presume that shows which arch is used by native.

On my skylake I get core2-avx and on my haswell I get corei7-avx.
configure fails with -march=skylake on my skylake.

Yes, this shows which arch gcc use for -march=native.

In your case skylake is too new for gcc, your gcc 4.8.4 doesn't know about it, it should choose the closest match with most features enabled. There's no 'core2-avx' in GCC 4.8.4 manual, maybe you meant 'core-avx2'? core-avx2 defines __AES__ automatically.

https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/i386-and-x86-64-Options.html
Jump to: