[ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 173.

joblo

legendary

Activity: 1470

Merit: 1114

A number of users have reported problems with AMD CPUS that don't have AES_NI but do
have SSE2.

Problems include compile failures and error exits on startup due to a perceived lack of support
for SSE2.

Solving this problem would go quicker with better info from the users when reporting problems.
This applies to any problem, not just the current AMD issues. As a result here are some tips for
problem reporting.

Give some info about your environment, CPU, OS, etc.

In addition to a description of the problem show the problem. Post the console session where the problem occurred
showing the command entered and the output produced.

Is it a new problem, ie did it work before?

Have you deviated from the recommended or previously used procedure or is there any change in the environment?

Have you tried to solve or workaround the problem yourself? How?

Provide info specific to the problem. In this case run the folllowing command and post the output:

gcc -march=native -Q --help=target | fgrep march

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: Giulini on April 22, 2016, 07:43:33 AM

Tried it out to delete the sourcecode line, then "SW built for SSE2..........NO." change to "YES", but the miner stops working immediately; tried also several -march versions

More info please. What exactly did you change? I didn't suggest changing no to yes. What does stopped working mean?
Did it compile?, did it crash? did it exit cleanly?

Giulini

full member

Activity: 192

Merit: 100

Tried it out to delete the sourcecode line, then "SW built for SSE2..........NO." change to "YES", but the miner stops working immediately; tried also several -march versions

Quote from: Giulini on April 21, 2016, 02:48:43 PM

same here with: Sempron145 CPU, configure und make with no mistakes

Quote from: th3.r00t on April 21, 2016, 02:40:08 PM

Code:

         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: joblo on April 21, 2016, 03:52:49 PM

Going after the algos would be daunting as each code segment would have to be analyzed individually. Modifying the scanning
engine to process two, or more, nonces in parallel might give bigger gains at lower effort.

How does ccminer do it in cuda?

it's pretty basic.
there is a for cycle with step = number of threads.
just divide the number of threads by the nonces per thread when running the kernel, and make the single thread process more nonces.
you can even do it all in the algo specific file (I did it for decred), without touching the main code.

AlexGR

legendary

Activity: 1708

Merit: 1049

Useful link for replacing slow & obsolete implementations: http://bench.cr.yp.to/primitives-hash.html

Perhaps if one googles algo by algo, they can find even better (?).

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: th3.r00t on April 21, 2016, 05:27:59 PM

Quote from: joblo

What this message is supposed to mean is although the CPU supports SSE2 it wasn't compiled in.
This should only occur is if you specify an arch that doesn't support SSE2. Which arch did you use?

Same as always:

Code:

./autogen.sh && ./configure CFLAGS="-O3 -march=btver1" --with-curl --with-crypto && make

I am 100% sure that btver1 includes SSE2

Quote from: joblo

It looks like AMD is going to be a challenge. As AMD users I'll leave it up to you guys to figure out the workarounds
for -march for various CPUs. I can then add notes to the README

The same commandline worked for AMD since cpuminer-opt-3.1.9 and now does'nt on cpuminer-opt-3.1.16.
Last version it worked was cpuminer-opt-3.1.15, so something is changed between them.

Also the compile output messages is really small, even in Intel CPU in cpuminer-opt-3.1.16.

I suspect the SSE2 SW check isn't working. Did you try to override it? If it works with the override I'll remove the check
permanently.

Edit: The override should work because the compile succeeded. If the compiler was truly compiling a non-SSE2 arch
it would have failed on the SSE2 instructions. It would seem the __SSE2__ compiler macro is unreliable. I may remove
the check completely or make it non-fatal.

th3.r00t

sr. member

Activity: 312

Merit: 250

Quote from: joblo

What this message is supposed to mean is although the CPU supports SSE2 it wasn't compiled in.
This should only occur is if you specify an arch that doesn't support SSE2. Which arch did you use?

Same as always:

Code:

./autogen.sh && ./configure CFLAGS="-O3 -march=btver1" --with-curl --with-crypto && make

I am 100% sure that btver1 includes SSE2

Quote from: joblo

It looks like AMD is going to be a challenge. As AMD users I'll leave it up to you guys to figure out the workarounds
for -march for various CPUs. I can then add notes to the README

The same commandline worked for AMD since cpuminer-opt-3.1.9 and now does'nt on cpuminer-opt-3.1.16.
Last version it worked was cpuminer-opt-3.1.15, so something is changed between them.

Also the compile output messages is really small, even in Intel CPU in cpuminer-opt-3.1.16.

AlexGR

legendary

Activity: 1708

Merit: 1049

No idea, haven't looked into CUDA mining.

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: AlexGR on April 21, 2016, 03:06:33 PM

Quote from: joblo on April 21, 2016, 02:23:33 PM

Quote from: AlexGR on April 21, 2016, 01:48:56 PM

May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

That's a fascinating idea but I don't think it will get the visibility here that it deserves. Pooler and TPruvot are the two main guys for
CPU mining although TPruvot is focussed more on other projects at the moment. Both have active threads in this forum. I suggest you
present your idea to them in case they, or their folllowers, may want to take on the challenge. It's beyond my skill level.

It's ok, don't worry. Some people reading this thread will know what to do with it.

I'm not in altcoin mining really as I don't have the hardware and I'm not in the mood of renting. Obviously there's a lot of money here for optimized miners that are doing multiple hashrates than the ordinary ones. But this idea also extends to scaling of bitcoin and altcoins for things like cryptographic verification etc. They are using serial functionality when it could be done in packs of 2 or 4 (or 8 in something like ...AVX3-4-5 - or AVX512 which already exists).

Going after the algos would be daunting as each code segment would have to be analyzed individually. Modifying the scanning
engine to process two, or more, nonces in parallel might give bigger gains at lower effort.

How does ccminer do it in cuda?

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: th3.r00t on April 21, 2016, 02:40:08 PM

Code:

         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?

What this message is supposed to mean is although the CPU supports SSE2 it wasn't compiled in.
This should only occur is if you specify an arch that doesn't support SSE2. Which arch did you use?

You can override the error by commenting out the exit statement below and recpompiling. Note
cpuminer may crash with the override if the message was correct.

cpu-miner.c function check_cpu_capability line#2700

Code:

         // make sure CPU has at least SSE2
         printf("   CPU arch supports SSE2.....");
         if ( cpu_has_sse2 )
         {
            printf("%s\n", grn_yes );
            printf("   SW built for SSE2..........");
            if ( sw_has_sse2 && !sw_has_aes )
            {
                printf("%s\n", grn_yes );
                printf_mine_without_aes();
            }
            else
            {
                printf("%s\n", ylw_no );
                printf_bad_build();
                exit(1);                            <-------- delete or comment this line

It looks like AMD is going to be a challenge. As AMD users I'll leave it up to you guys to figure out the workarounds
for -march for various CPUs. I can then add notes to the README

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: pallas on April 21, 2016, 02:37:59 PM

It's not a new idea. It was used back in the GPU bitcoin mining days to get better speed on amd VLIW cards.
It's easy to adapt the miner itself to process multiple nonces per thread, not sure about how much work is needed to work on the algos themselves. Maybe we could make a test with a simple algo like blake. But I'm not the man because I'm not proficient in those cpu instruction extensions.

Neither am I, but it's not that difficult.

Say for example you have a loop like:

for (i = 0; i <100000000; i++)
   b=sqrt (b);
   bb=sqrt(bb);
   bbb=sqrt(bbb);
   bbbb=sqrt(bbbb);

...gcc will make it something like:

40072e: 0f 84 9b 00 00 00 je 4007cf
  400734: f2 0f 51 d6 sqrtsd %xmm6,%xmm2
  400738: 66 0f 2e d2 ucomisd %xmm2,%xmm2
  40073c: 0f 8a 63 02 00 00 jp 4009a5
  400742: 66 0f 28 f2 movapd %xmm2,%xmm6
  400746: f2 0f 51 cd sqrtsd %xmm5,%xmm1
  40074a: 66 0f 2e c9 ucomisd %xmm1,%xmm1
  40074e: 0f 8a d9 01 00 00 jp 40092d
  400754: 66 0f 28 e9 movapd %xmm1,%xmm5
  400758: f2 0f 51 c7 sqrtsd %xmm7,%xmm0
  40075c: 66 0f 2e c0 ucomisd %xmm0,%xmm0
  400760: 0f 8a 47 01 00 00 jp 4008ad
  400766: 66 0f 28 f8 movapd %xmm0,%xmm7
  40076a: f2 0f 51 c3 sqrtsd %xmm3,%xmm0
  40076e: 66 0f 2e c0 ucomisd %xmm0,%xmm0
  400772: 0f 8a b5 00 00 00 jp 40082d

...which is sqrt-scalar-double.

4 instructions / 4 math operations.

What could be done differently (intel syntax follows):

   movlpd xmm1, b //loading the first variable "b" to the lower part of xmm1
   movhpd xmm1, bb //loading the second variable "bb" to the higher part of xmm1
   SQRTPD xmm1, xmm1 //batch processing both variables for their square root, with one SIMD command
   movlpd xmm2, bbb //loading the third variable "bbb" to the lower part of xmm2
   movhpd xmm2, bbbb //loading the fourth variable "bbbb" to the higher part of xmm2
   SQRTPD xmm2, xmm2 //batch processing their square roots
   movlpd b, xmm1 //
   movhpd bb, xmm1 // Returning all results from the register back memory
   movlpd bbb, xmm2 //
   movhpd bbbb, xmm2 //

SQRTPD - Square root - P(acked)-Double.

So now 4 maths instructions became 2 and the time got down in half (I've actually benchmarked the above and it goes near half). But in order to pack instructions (math or logical) you need to have similar processing load, similar operations. You can't have that in a scenario where it goes like

sqrt
add
shift
xor

and the function is changing...

But if you loaded 4x hashes together, you'd be looking at

sqrt(of the first) sqrt (of the second) sqrt (third) sqrt (fourth) (<=pack them)
add add add add (<=pack them)
shift shift shift shift (<=pack them)
xor xor xor xor (
...etc

I wasn't even aware of the above, until a couple of weeks ago when I got down to asm level to see what happens and why some Pascal output was slower than C output... then I run into http://x86.renejeschke.de as a reference where I was trying to understand the instructions and what they are doing, and then rewrote some instructions myself - like the above with the packed (I thought it was pretty easy really) and then, more recently, I went over the code of the asm hash functions of altcoins and bitcoin - and it was full of serial operations, despite "SSE/AVX use" / "SSE/AVX enhanced". And I'm like WHAT THE F***? This is all crippled.

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: joblo on April 21, 2016, 02:23:33 PM

Quote from: AlexGR on April 21, 2016, 01:48:56 PM

May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

That's a fascinating idea but I don't think it will get the visibility here that it deserves. Pooler and TPruvot are the two main guys for
CPU mining although TPruvot is focussed more on other projects at the moment. Both have active threads in this forum. I suggest you
present your idea to them in case they, or their folllowers, may want to take on the challenge. It's beyond my skill level.

It's ok, don't worry. Some people reading this thread will know what to do with it.

I'm not in altcoin mining really as I don't have the hardware and I'm not in the mood of renting. Obviously there's a lot of money here for optimized miners that are doing multiple hashrates than the ordinary ones. But this idea also extends to scaling of bitcoin and altcoins for things like cryptographic verification etc. They are using serial functionality when it could be done in packs of 2 or 4 (or 8 in something like ...AVX3-4-5 - or AVX512 which already exists).

Giulini

full member

Activity: 192

Merit: 100

same here with: Sempron145 CPU, configure und make with no mistakes

Quote from: th3.r00t on April 21, 2016, 02:40:08 PM

Code:

         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?

th3.r00t

sr. member

Activity: 312

Merit: 250

Code:

         **********  cpuminer-opt 3.1.16  ***********
     A CPU miner with multi algo support and optimized for CPUs
     with AES_NI extension.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT
     Forked from TPruvot's cpuminer-multi with credits
     to Lucas Jones, elmad, palmd, djm34, pooler, ig0tik3d,
     Wolf0 and Jeff Garzik.

Checking CPU capatibility...
        AMD Phenom(tm) II X4 940 Processor
   CPU arch supports AES_NI...NO.
   CPU arch supports SSE2.....YES.
   SW built for SSE2..........NO.
Incompatible SW build, rebuild with "-march=native"

Why?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

It's not a new idea. It was used back in the GPU bitcoin mining days to get better speed on amd VLIW cards.
It's easy to adapt the miner itself to process multiple nonces per thread, not sure about how much work is needed to work on the algos themselves. Maybe we could make a test with a simple algo like blake. But I'm not the man because I'm not proficient in those cpu instruction extensions.

th3.r00t

sr. member

Activity: 312

Merit: 250

Quote from: hmage on April 21, 2016, 08:50:44 AM

Quote from: th3.r00t on April 20, 2016, 02:17:53 PM

Quote from: hmage on April 20, 2016, 01:08:20 PM

Can you run this command on your AMD processors and show me the output?

Code:

gcc -march=native -Q --help=target | fgrep march

Here you are:

Code:

root@beast:~$ gcc -march=native -Q --help=target | fgrep march
  -march=                               amdfam10

Then your build should have worked out of the box, with -march=native. amdfam10 doesn't have AES support and gcc won't define __AES__ macro. Can you try building it with -march=native again?

Quote from: joblo on April 20, 2016, 02:30:05 PM

This curious. I presume that shows which arch is used by native.

On my skylake I get core2-avx and on my haswell I get corei7-avx.
configure fails with -march=skylake on my skylake.

Yes, this shows which arch gcc use for -march=native.

In your case skylake is too new for gcc, your gcc 4.8.4 doesn't know about it, it should choose the closest match with most features enabled. There's no 'core2-avx' in GCC 4.8.4 manual, maybe you meant 'core-avx2'? core-avx2 defines __AES__ automatically.

https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/i386-and-x86-64-Options.html

Nope...
Still ./build.sh fails on AMD
Maybe I am to try something else?

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: AlexGR on April 21, 2016, 01:48:56 PM

May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

That's a fascinating idea but I don't think it will get the visibility here that it deserves. Pooler and TPruvot are the two main guys for
CPU mining although TPruvot is focussed more on other projects at the moment. Both have active threads in this forum. I suggest you
present your idea to them in case they, or their folllowers, may want to take on the challenge. It's beyond my skill level.

AlexGR

legendary

Activity: 1708

Merit: 1049

May I propose a different approach for much faster mining?

Currently, most, if not all of CPU-mineable coins, are cripple-mined.

The reason is simple: Under-utilizing of the SIMD nature of SSE & AVX sets.

SSE and AVX commands are used in SISD fashion (single instruction single data, instead of Multiple data / SIMD), meaning they are not processing 2 batches of information but one.

Right now hashing goes on like that:

The main mining routine sends one output to each hash, where it will be subject to a process of SERIAL transmutations / permutation and in the end the hash will output that data back to the miner (some times to send it to the next hash).

This serial process doesn't allow for much Single Instruction Multiple Data utilization.

What should be done instead is that the miner program should issue 2-4 hash candidates to the hashing routines. The hashing routines should be able to get 2-4 inputs (instead of 1) and return back 2-4 outputs. In this way the process would be paralleled and SIMD utilization (packed processing of similar instructions) would result in much faster processing.

Now this might require a lot of recoding, or, one could adjust the code in C for use with a special compiler which runs multiple instances of serial data crunching in order to process them in "packs" with SIMD or "packed" instructions - and then let the compiler do all the packing. Performance benefits of such an approach here: http://ispc.github.io/perf.html

joblo

legendary

Activity: 1470

Merit: 1114

Quote from: hmage on April 21, 2016, 08:50:44 AM

Quote from: th3.r00t on April 20, 2016, 02:17:53 PM

Quote from: hmage on April 20, 2016, 01:08:20 PM

Can you run this command on your AMD processors and show me the output?

Code:

gcc -march=native -Q --help=target | fgrep march

Here you are:

Code:

root@beast:~$ gcc -march=native -Q --help=target | fgrep march
  -march=                               amdfam10

Then your build should have worked out of the box, with -march=native. amdfam10 doesn't have AES support and gcc won't define __AES__ macro. Can you try building it with -march=native again?

Quote from: joblo on April 20, 2016, 02:30:05 PM

This curious. I presume that shows which arch is used by native.

On my skylake I get ~~core2-avx~~ core-avx2 and on my ~~haswell~~ sandy bridge I get corei7-avx.
configure fails with -march=skylake on my skylake.

Yes, this shows which arch gcc use for -march=native.

In your case skylake is too new for gcc, your gcc 4.8.4 doesn't know about it, it should choose the closest match with most features enabled. There's no 'core2-avx' in GCC 4.8.4 manual, maybe you meant 'core-avx2'? core-avx2 defines __AES__ automatically.

https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/i386-and-x86-64-Options.html

Hmage, you're a good teacher and you know your stuff, I'm learning a lot. Core-avx2 correction noted.

hmage

member

Activity: 83

Merit: 10

Quote from: th3.r00t on April 20, 2016, 02:17:53 PM

Quote from: hmage on April 20, 2016, 01:08:20 PM

Can you run this command on your AMD processors and show me the output?

Code:

gcc -march=native -Q --help=target | fgrep march

Here you are:

Code:

root@beast:~$ gcc -march=native -Q --help=target | fgrep march
  -march=                               amdfam10

Then your build should have worked out of the box, with -march=native. amdfam10 doesn't have AES support and gcc won't define __AES__ macro. Can you try building it with -march=native again?

Quote from: joblo on April 20, 2016, 02:30:05 PM

This curious. I presume that shows which arch is used by native.

On my skylake I get core2-avx and on my haswell I get corei7-avx.
configure fails with -march=skylake on my skylake.

Yes, this shows which arch gcc use for -march=native.

In your case skylake is too new for gcc, your gcc 4.8.4 doesn't know about it, it should choose the closest match with most features enabled. There's no 'core2-avx' in GCC 4.8.4 manual, maybe you meant 'core-avx2'? core-avx2 defines __AES__ automatically.

https://gcc.gnu.org/onlinedocs/gcc-4.8.4/gcc/i386-and-x86-64-Options.html

Topic: [ANN]: cpuminer-opt v3.8.8.1, open source optimized multi-algo CPU miner - page 173. (Read 444131 times)