An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only)

Bestcoin-fan

member

Activity: 154

Merit: 11

Hi) Just for fun (I might be lucky to mine a BTC block however)) I'd like to try BTC mining with this soft.
What command line options should I use for solo mining?? Thanks!

I've tried this command:
minerd.exe --algo=sha256d --coinbase-addr=myBTCwalletAddressHere --url=localhost

The software displays the error:
HTTP request failed: failed to connect to localhost port 80: connection refused

Why should I have an HTTP server on my localhost to run the program?

RadimB

newbie

Activity: 1

Merit: 0

Quote from: denis_msz on May 19, 2018, 01:08:56 PM

Hi guys, Denis here. I am trying to mine BANK. Followed instructions on pool site. Created account, created worker. Downloaded MinerD (PoolMiner 2.5.0). Created the "RunMe.bat" file. I copied this "minerd -a scrypt -t 6 -s 4 -o stratum + tcp: //139.162.40.230: 3335 -u Weblogin.WorkerName -p WorkerPassword" into the "runMe" file and changed weblogin, worker name and password. When I run the file black screen comes up for millisecond and disapears. Can't find any videos or instructions on how to fix this. Please help?

Thank you. Error:
minerd: unsupported non-option argument -- '+'

What was the name of the mining pool he used here: "minerd -a scrypt -t 6 -s 4 -o stratum + tcp: //139.162.40.230: 3335 -u Weblogin.WorkerName -p WorkerPassword"?

xerxes von braun

newbie

Activity: 4

Merit: 0

Quote from: JayDDee on November 05, 2023, 12:07:58 PM

There are many other threads discussing this already (watch out for the shills) or you can start your own if none of the existing ones suits you.
The miner threads are for more technical mining issues, including newbie fat finger issues Wink

.

Mining advice is something I don't discuss, although many others do, because it's a matter of opinion and everyone has their own.

i see, well i will search then, yeah kinda i still need to read a lot but since minimal payout in pools for most crypto is out of range for those like me that have old machine
if a cpu can submit shares like up to 1k on scrypt algo then a cheaper token based on ltc wich could make all the work at least with the option to retrieve them instead to watch litoshis stuck in pool cause i will never reach the minimal ammount to get the payout, you know what i mean?
if there are tokens like TRX that is an erc-20 token and 1 trx does not worth
example a low price token based on litecoin and cheap like doge or trx
the thing is that there are no pools
well is just an idea, i will retire posting here then
cause as you said is for your miner technical thing

laters! thanks for reply!

*edit*

i switched back to binance pool since the lil small ammount ive mined yesterday has appeared in my funding while mining in litecoinpool will take for ever to earn
is not much but ill rather get what ive mined instead of watching in vain

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: xerxes von braun on November 05, 2023, 11:46:14 AM

no is not advice column just looking for recommendations since im using the software that some of you coded thats all
minerd and cpuminer
thus im currently mining with scrypt algo
just wanted to know if there is any other low price coin or token that could be mineable with cpu but it seems there is nothing to do

There are many other threads discussing this already (watch out for the shills) or you can start your own if none of the existing ones suits you.
The miner threads are for more technical mining issues, including newbie fat finger issues Wink

.

Mining advice is something I don't discuss, although many others do, because it's a matter of opinion and everyone has their own.

xerxes von braun

newbie

Activity: 4

Merit: 0

Quote from: JayDDee on November 04, 2023, 11:35:44 PM

This isn't an advice column. It's about mining sha256d and scrypt with cpuminer, neither of which has been profitable to mine with
a CPU for 10 years. One piece of advice: don't mine anything with a laptop.

thanks for the reply Jay
no is not advice column just looking for recommendations since im using the software that some of you coded thats all
minerd and cpuminer
thus im currently mining with scrypt algo
just wanted to know if there is any other low price coin or token that could be mineable with cpu but it seems there is nothing to do

https://i.ibb.co/kKWr1xg/Screenshot-2023-11-05-133849.png

actually that is one of the things that sometimes concerns me
a cpu is capable to mine it does the job, ofc it wont make super ammount of ltc
but maybe what is needed is a new token

i still need to study and learn some stuff i have been away for years
well i wont make this long

laters

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: xerxes von braun on November 04, 2023, 01:09:45 PM

hello !

i have Intel(R) Core(TM) i7-3632QM CPU 2.20GHz
actually is 4 physical core and 8 logical

im currently mining with scrypt at binance pool
in past i used to mine XMR but profit was very low

should i try mining another algo with this old L845 from 2014...

i need to make some profit like using my cpu and purchasing some cloud boost or else

any advice ?

i switched to litecoinpool since worker is not showing in binance pool for hours

This isn't an advice column. It's about mining sha256d and scrypt with cpuminer, neither of which has been profitable to mine with
a CPU for 10 years. One piece of advice: don't mine anything with a laptop.

xerxes von braun

newbie

Activity: 4

Merit: 0

hello !

i have Intel(R) Core(TM) i7-3632QM CPU 2.20GHz
actually is 4 physical core and 8 logical

im currently mining with scrypt at binance pool
in past i used to mine XMR but profit was very low

should i try mining another algo with this old L845 from 2014...

i need to make some profit like using my cpu and purchasing some cloud boost or else

any advice ?

i switched to litecoinpool since worker is not showing in binance pool for hours

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: ag1233 on September 15, 2023, 12:37:25 AM

blurb:
my guess is GPUs and maybe Intel, AMD probably has a different hardware architecture that can drastically speed up loads and stores to / from the vector registers.
raspberry pi that broadcom a72 cpu uses an external dram that has a 16 bits bus, there is little that can be done about it as it needs to optimise for 'cheapness' and space and ease of manufacturing.
I think some GPUs and 'high end' CPUs has more like 64 bits or even 128 bits or more data buses which can possibly load/store 2x64 or more words or more in a single cycle.
at just 16 bits dram data bus even a single 64 bits words takes 4 cycles just to get or store that word and one need to add all the memory wait states which means more cycles.
i avoided that in my 'hand coded assembly' and simply used cpu registers, but that it did not overcome the stalls writing to / reading from neon registers.
for a comparison, without optimization i.e. no -O2 and/or -ftreevectorize flags and simply using that 'rearranged arrays salsa to look like lanes', so that it writes to and read from memory on the raspberry pi.
it takes > 1 seconds for that same 1048576 rounds, all that optimizations is then hundreds of times faster than if it isn't optimised, that is even true vs the naive neon simd.

Good stuff.

If you're just playing around to learn I suggest you take a look at the Blake family of algorithms, I doesn't have memory issues that Salsa does. It also supports
both linear and parallel vector coding optimizations, even using both together. Linear typically requires some cross laning but doesn't increase memory usage, parallel
doesn't require cross laning but memory requirements scale with the number of parallel data lanes and is very sensitive to data dependant addressing.

ag1233

newbie

Activity: 7

Merit: 0

Quote from: JayDDee on September 14, 2023, 02:55:40 PM

It looks like your comparing original code with -ftreevectorize to "hand coded" with -ftreevectorize. That doesn't prove anything about -ftreevectorize.
You need to test the same code with and without vectorization. Did your hand coded version actually use parallel SIMD Salsa on the data "arranged in lanes"?

ok the original driving codes is like such

Code: ("main.c")

#include 
#include 
void salsa(uint *X, uint rounds);

int main(int argc, char **argv) {
    uint X[16];
    const int rounds = 1024*1024;
    clock_t start, end;
    double cpu_time_used;

    for(int i=0; i<16; i++)
        X[i] = i;

    start = clock();
    salsa(X,rounds);
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("cputime %g\n", cpu_time_used);
 
}

/* Salsa20, rounds must be a multiple of 2 */
void __attribute__ ((noinline)) salsa(uint *X, uint rounds) {
    uint x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, t
;

    x0 = X[0];   x1 = X[1];   x2 = X[2];   x3 = X[3];
    x4 = X[4];   x5 = X[5];   x6 = X[6];   x7 = X[7];
    x8 = X[8];   x9 = X[9];  x10 = X[10]; x11 = X[11];
   x12 = X[12]; x13 = X[13]; x14 = X[14]; x15 = X[15];

#define quarter(a, b, c, d, v) \
    t = a + d; if (v) printf("t: %d\n",t); \
    t = ROTL32(t,  7); if(v) printf("t: %d\n",t); \
    b ^= t; if(v) printf("b: %d\n",b); \
    t = b + a; if(v) printf("t: %d\n",t); \
    t = ROTL32(t,  9); if(v) printf("t: %d\n",t); \
    c ^= t; if(v) printf("c: %d\n",c); \
    t = c + b; if(v) printf("t: %d\n",t); \
    t = ROTL32(t, 13); if(v) printf("t: %d\n",t); \
    d ^= t; if(v) printf("d: %d\n",d); \
    t = d + c; if(v) printf("t: %d\n",t); \
    t = ROTL32(t, 18); if(v) printf("t: %d\n",t); \
    a ^= t; if(v) printf("a: %d\n",a);

    int v = 0;
    for(; rounds; rounds -= 2) {
        quarter( x0,  x4,  x8, x12, v);
        quarter( x5,  x9, x13,  x1, v);
        quarter(x10, x14,  x2,  x6, v);
        quarter(x15,  x3,  x7, x11, v);
        quarter( x0,  x1,  x2,  x3, v);
        quarter( x5,  x6,  x7,  x4, v);
        quarter(x10, x11,  x8,  x9, v);
        quarter(x15, x12, x13, x14, v);
    }

    X[0] += x0;   X[1] += x1;   X[2] += x2;   X[3] += x3;
    X[4] += x4;   X[5] += x5;   X[6] += x6;   X[7] += x7;
    X[8] += x8;   X[9] += x9;  X[10] += x10; X[11] += x11;
   X[12] += x12; X[13] += x13; X[14] += x14; X[15] += x15;

#undef quarter
}

that will run 1024 x 1024 rounds ~ 1048576 rounds
without optimization i.e. no -O flags

Code:

cputime 0.187971
cputime 0.231245
cputime 0.187873

~ 202.363 ms for that 1048576 rounds

with optimization -O2 but no -ftreevectorize

Code:

cputime 0.011749
cputime 0.011733
cputime 0.025701

~ 16.394 ms for that 1048576 rounds

with

Code:

-O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

Code:

cputime 0.012641
cputime 0.012635
cputime 0.012638

~ 12.638 ms for that 1048576 rounds
there is only a small amount of NEON simd assembly generated by -ftreevectorize, all those v* codes

Code:

        ...
        str     r2, [sp, #164]
        ldr     r2, [sp, #12]
        vldr    d16, [sp, #152]
        vldr    d17, [sp, #160]
        str     r2, [sp, #172]
        ldr     r2, [sp, #64]
        vldr    d20, [sp, #168]
        vldr    d21, [sp, #176]
        str     r2, [sp, #184]
        vadd.i32        q8, q8, q9
        ldr     r2, [sp, #36]
        str     r2, [sp, #188]
        ldr     r2, [sp, #68]
        str     r2, [sp, #192]
        ldr     r2, [sp, #72]
        str     r2, [sp, #196]
        vldr    d18, [sp, #184]
        vldr    d19, [sp, #192]
        str     r7, [sp, #212]
        ldr     r2, [sp]
        str     r2, [sp, #204]
        ldr     r2, [sp, #24]
        vadd.i32        q9, q9, q10

for practical purpose the running speeds with -ftreevectorize vs without is nearly the same as -O2, my earlier observations is likely results from differing cache conditions and possible cpu context switches, i.e. the cpu run other threads

my 'hand optimised assembly' looks like this

Code:

/* this is for the salsa quarter round, 4 rounds in parallel, 
i.e.  each instruction does 4 different quarter rounds in a 4x4 matrix*/
#define vquarter(VA, VB, VC, SHIFT) \
        vt = vaddq_u32(VB, VA); \
        vt1 = vshlq_n_u32(vt, SHIFT); \
        vt1 = vsraq_n_u32(vt1, vt, 32-SHIFT); \
        VC = veorq_u32(VC, vt1);

but that there are a lot of permutations codes, this is an abstract from assembly generated from gcc -S option

Code:

        vld1.32 {d22-d23}, [r5]
        vld1.32 {d24-d25}, [r8]
        vld1.32 {d18-d19}, [r7]
        vld1.32 {d20-d21}, [r6]
        cmp     r4, #0
        beq     .L2
.L3:
...     matrix permutations
        vmov.32 r3, d22[0]
        vmov.32 d8[0], r3
        vmov.32 r3, d24[1]
        vmov.32 d8[1], r3
        vmov.32 r3, d19[0]
        vmov.32 d9[0], r3
        vmov.32 r3, d21[1]
        vmov.32 d9[1], r3
        vmov.32 r3, d24[0]
        vmov.32 d10[0], r3
        vmov.32 r3, d18[1]
        vmov.32 d10[1], r3
        vmov.32 r3, d21[0]
        vmov.32 d11[0], r3
        vmov.32 r3, d23[1]
... 
...     // all that 4x parallel quarter rounds generated from the neon intrinsics
        vmov    q7, q10  @ v4si
        vadd.i32        q8, q10, q4
        vshl.i32        q9, q8, #7
        vsra.u32        q9, q8, #25
        veor    q9, q9, q5
        vadd.i32        q10, q9, q4
        vshl.i32        q8, q10, #9
        vsra.u32        q8, q10, #23
        veor    q8, q8, q6
        vadd.i32        q11, q8, q9
        vshl.i32        q10, q11, #13
        vsra.u32        q10, q11, #19
        veor    q10, q10, q7
        vadd.i32        q12, q10, q8
        vshl.i32        q11, q12, #18
        vsra.u32        q11, q12, #14
        veor    q11, q11, q4
... then more matrix permutations
        vmov.32 r3, d20[1]
        vmov    q5, q9  @ v4si
        vmov.32 d10[0], r3
        vmov.32 r3, d21[0]
        vmov.32 d10[1], r3
        vmov.32 r3, d21[1]
        vmov.32 d11[0], r3
        vmov.32 r3, d20[0]
        vmov    q14, q5  @ v4si
        vmov.32 d29[1], r3
        vmov.32 r3, d17[0]
        vmov    q6, q8  @ v4si
        vmov.32 d12[0], r3
        vmov.32 r3, d17[1]
        vmov.32 d12[1], r3
        vmov.32 r3, d16[0]

the above codes is nearly 'everything NEON', all that permutations is a bummer and cannot be paralleled (e.g. a lot of stalls as different values need to be loaded into a single simd register)

Code:

cputime 0.059751
cputime 0.05988
cputime 0.059885

~ 59.8387 ms for that 1048576 rounds

it is likely that there are 'better ways to do neon', but that gcc's -O2 and -ftreevectorize takes less than a minute to type in a script / config file and probably just works be it ARM, Intel or AMD, etc cpus.
And that it is 5x faster than this naive neon implementation.

blurb:
my guess is GPUs and maybe Intel, AMD probably has a different hardware architecture that can drastically speed up loads and stores to / from the vector registers.
raspberry pi that broadcom a72 cpu uses an external dram that has a 16 bits bus, there is little that can be done about it as it needs to optimise for 'cheapness' and space and ease of manufacturing.
I think some GPUs and 'high end' CPUs has more like 64 bits or even 128 bits or more data buses which can possibly load/store 2x64 or more words or more in a single cycle.
at just 16 bits dram data bus even a single 64 bits words takes 4 cycles just to get or store that word and one need to add all the memory wait states which means more cycles.
i avoided that in my 'hand coded assembly' and simply used cpu registers, but that it did not overcome the stalls writing to / reading from neon registers.
for a comparison, without optimization i.e. no -O2 and/or -ftreevectorize flags and simply using that 'rearranged arrays salsa to look like lanes', so that it writes to and read from memory on the raspberry pi.
it takes > 1 seconds for that same 1048576 rounds, all that optimizations is then hundreds of times faster than if it isn't optimised, that is even true vs the naive neon simd.

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: ag1233 on September 14, 2023, 11:02:23 AM

I did one more experiment doing salsa20 with neon Intrinsics
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html
'hand optimized' but naive
run salsa20 with a million loops
and it turns out gcc's -ftreevectorize did 1 million loops in 11 ms original C codes

then rearranged arrays into 'lanes' with -ftreevectorize did 1 million loops in 80 ms (but varies dependent on the cache), 2nd run tends to be faster
this version writes out arrays to memory during permutation. lots of wait states and stalls.

and naive 'hand optimized' salsa20 with neon and all takes 59 ms for 1 million loops

this kind of means that it isn't true -ftreevectorize is slow, in that it sometimes beat 'naive' hand optimized codes. it is probably close to being a best optimized codes, but that the generated assembly is practically unreadable. generated by machines for machines.

It looks like your comparing original code with -ftreevectorize to "hand coded" with -ftreevectorize. That doesn't prove anything about -ftreevectorize.
You need to test the same code with and without vectorization. Did your hand coded version actually use parallel SIMD Salsa on the data "arranged in lanes"?

ag1233

newbie

Activity: 7

Merit: 0

I did one more experiment doing salsa20 with neon Intrinsics
https://arm-software.github.io/acle/neon_intrinsics/advsimd.html
'hand optimized' but naive
run salsa20 with a million loops
and it turns out gcc's -ftreevectorize did 1 million loops in 11 ms original C codes

then rearranged arrays into 'lanes' with -ftreevectorize did 1 million loops in 80 ms (but varies dependent on the cache), 2nd run tends to be faster
this version writes out arrays to memory during permutation. lots of wait states and stalls.

and naive 'hand optimized' salsa20 with neon and all takes 59 ms for 1 million loops

this kind of means that it isn't true -ftreevectorize is slow, in that it sometimes beat 'naive' hand optimized codes. it is probably close to being a best optimized codes, but that the generated assembly is practically unreadable. generated by machines for machines.

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: ag1233 on September 12, 2023, 01:22:32 PM

JayDDee,
Thanks for your comments, I'd leave it at that for now as many others would be reading this thread.

My pleasure.

Quote

off-topic:
I'd think also that using features like NEON SIMD have severe implications on high core count ARM chips, e.g. those 'Ampere' chips

That is correct. SIMD reduces the number of instructions but increases the load of each one, like doubling or quadrupling the load on a truck.
It's more efficient but you need a strong truck. SIMD also increases the load on the memory system to try to keep the CPU fed. The end result
is more heat everywhere.

ag1233

newbie

Activity: 7

Merit: 0

JayDDee,
Thanks for your comments, I'd leave it at that for now as many others would be reading this thread.
Rather, it is correct to say that -ftree-vectorize is not a 'miracle pill', but that in some cases, one may observe an edge, and actually it isn't that much more, I think that 20% gain is not rigorously measured.
i.e. expect less gains in fact with the use of -ftree-vectorize. But that given my experiments, it is likely there is 'some' gains, but not a lot.

I think it may be a mixed blessing to document or use the -ftree-vectorize. Every miner wishes to 'go faster'. But as I've mentioned, ARM cpus are less like the high end Intel and AMD countnerparts.
I think there are A53 ARM cpus that may possibly not have that NEON SIMD blocks, especially the 'cheap' ones, and that running NEON optimised codes would likely either just cause the app to abort or produce incorrect results. and it may be very hard to catch if in case that 'neon optomised' codes is treated as nop (no op), one could be mining for hours and not getting a single share and thinking that it is too hard when it is simply that your A53 chip don't have NEON.

the note is for 'developers' that -ftree-vectorize may be an 'easy' way to get 'some SIMD' without working through the difficult job of working hand optimised assembly.
it also make the codes 'more portable' in a sense that -ftreevectorize may be a similar option be it Intel, AMD, or ARM or others with gcc compilers. and that it is the same scrypt.c code / file.

off-topic:
I'd think also that using features like NEON SIMD have severe implications on high core count ARM chips, e.g. those 'Ampere' chips
https://amperecomputing.com/products/processors
the heat output on 128 cores all doing NEON SIMD will be so high it will either destroy the chip / server or that the frequency needs to be severely throttled say to well less than 1 ghz
or maybe that it needs to run with a boiling tub of liquid nitrogen on top of it to run at -170 deg C

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: ag1233 on September 12, 2023, 11:04:22 AM

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.

Very few of those gains are in the hashing code. There are SIMD versions of Salsa20 available used in most implementations of scrypt and neoscrypt
but they are all hand coded. Most of the gains from compiler optimizing are from loops with no dependencies.

Parallel SIMD requires data locality to be efficient as you observed. CPU memory doesn't do well with scattered memory access.
Neoscrypt, if that is your real interest, doesn't do well with parallel hashing due to Salsa20 so it would be futile to try it.
Pooler scrypt code uses multiple buffering for Salsa20 but it's not parallel. The Salsa20 SIMD Is all single stream.

I tried various forms of parallel scrypt with poor results. The remains of those efforts still exist in my code...
https://github.com/JayDDee/cpuminer-opt/blob/master/algo/scrypt/scrypt-core-4way.c

Edit: I should also add that salsa20 is a 32 bit hash function so it doesn't benefit from 64 bit architecture. 64 bits just means you can do 2 way parallel
which doesn't help any and would still need to be hand coded. The simplest route would be to make the 32 bit ASM code to run on 64 bit ARM.
I'm not familiar with ARM architecture so I don't know how incompatible 32 & 64 bit ASM is. It might require tweaking every instruction, there might be tools
to simplify that.

ag1233

newbie

Activity: 7

Merit: 0

thank JayDDee, there is a minor gain with -ftree-vectorize for Neoscrypt as it spend a large number of loops in salsa20, 1000 x some 200 rounds?
hence, NEON SIMD could potentially speed that up significantly, the trouble is that salsa20
https://en.wikipedia.org/wiki/Salsa20
permutates, the arrays between the quarter rounds in each loop. I did a naive attempt by simply re-arrange the arrays in C codes so that they looked like they fall into 'lanes' (common to SIMD).
that oversimplified approach don't cut it with -ftree-vectorize, the registers get streamed out into memory (lots of wait states and cpu stalls for the small Raspberry Pi type boards and cpus).
but that hand optimized assembly won't be easy to write and that they'd take quite a lot of effort.
and the thing is this won't be the only thing that needs to be optimized.

Hence, for now the 'easy' way is to simply -ftree-vectorize with the other flags in bundle so that at least some form of NEON SIMD is achieved.
There is a decent gain like 20% (for Neoscrypt) with vs without the compiler generated SIMD codes.

JayDDee

full member

Activity: 1436

Merit: 232

Quote from: ag1233 on September 12, 2023, 09:59:40 AM

thanks pooler, I'm thinking you may want to add an 'option' or document the flags mentioned, I think we'd leave the challenge of hand optimizing part of that code to some other time or if someone may want to take up the challenge.

By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

-ftreevectorize isn't that useful for hash code. Most of the SIMD gains are from hashing multiple nonces in parallel wihin a single CPU thread. The compiler can't do that,
it must be hand coded. See sha256d ASM for an example. All the SIMD code is in support of hashing nonces in parallel outside the scope of any compiler optimizing.

Scrypt Salsa could theoretically be optimized by the compiler but only if the compiler was written to recognize the code as Salsa. That would be a stretch. The hand
written ASM does what the compiler can't.

What's really needed is to write NEON ASM by hand to do parallel hashing. However, that's not feasible with most CPUs now supporting HW accelerated sha256.
Writing ARM 64 bit SIMD for sha256 would be nothing more than an academic exercise.

cpuminer-opt has good examples of parallel sha256 using 128, 256 & 512 bit x86 SIMD as well as a HW accelerated version using intrinsics which are a little more readable
tham ASM.

ag1233

newbie

Activity: 7

Merit: 0

thanks pooler, I'm thinking you may want to add an 'option' or document the flags mentioned, I think we'd leave the challenge of hand optimizing part of that code to some other time or if someone may want to take up the challenge.

By just using those flags mentioned, gcc builds binaries with NEON SIMD using that -ftree-vectorize flag, along with the other flags as otherwise it doesn't turn on SIMD codes.
This is a 'quick and easy' way to at least get some NEON SIMD on aarch64 and it isn't too bad as i've described.

off-topic:
just to add a note, I tried to 'hand optimize' it by making a c source where I re-arrange the c arrays in salsa20 to fall into 'lanes' and using the same -ftree-optimize flags, however, instead of being faster the original codes are optimised better even though it actually used less NEON SIMD codes, i looked closer at the generated neon simd codes, I think the problem is that simply 're-arranging' the arrays won't cut it as between the iterations/loops the array is permuted, so that gets streamed out to memory, this is a bummer I'd think lots of stalls then it gets loaded back from memory into a different permuted array of registers.
While with the original codes, there is actually less SIMD. it seemed -ftree-vectorize and other optimizations simply used the normal registers for part of the codes and passing them into simd registers for some sections of the codes, that in itself is faster than the 'rearranged array' codes.

pooler

hero member

Activity: 849

Merit: 507

Quote from: ag1233 on September 12, 2023, 04:53:43 AM

recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/pooler/cpuminer

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

It's not about issues, it's just plain incompatible. Despite the name similarity, 64-bit ARM (AArch64) is an entirely different architecture than 32-bit ARM, and the hand-written assembly code in cpuminer only targets the latter. I never got around to writing optimized code for AArch64, so all you can do at the moment is compile a C-only version of the miner.

ag1233

newbie

Activity: 7

Merit: 0

hi pooler,
are you still monitoring this thread?

hi all,
recently I tried getting cpuminer to run on a Raspberry Pi 4 (aarch64)
I'm using somewhat older codes (version 2.4) from
https://github.com/pooler/cpuminer

I couldn't figure out how to get it to build with the ARM assembly codes, and apparently scrypt-arm.S seemed to be written for armhf (32 bit ARM microprocessors).
Hence, there could possibly with issues compiling in aarch64 (ARM 64bit instructions and OS)

Among the things I tried, I added the following flags

Code:

 -O2 -ftree-vectorize -ftree-slp-vectorize -ftree-loop-vectorize -ffast-math -ftree-vectorizer-verbose=7 -funsafe-loop-optimizations -funsafe-math-optimizations

I checked and try compiling with "-S" option which makes it generate assembly codes, apparently among the suite of flags used above, it causes GCC to generate assembly codes with NEON SIMD.
This is without specific hand optimized assembly. It may possibly still make NEON assembly with a few less flags (e.g. possibly less -ftree-slp-vectorize, but that I think this is useful even without NEON), but that when missing some of the above flags, NEON assembly isn't generated.

I tried with and without the above flags e.g. just -O2, there is at least a slight difference in hash rates, from about 4 khash per sec on all 4 cores doing Neoscrypt - mining Feathercoin to about 5+ khash per sec with NEON optimised codes by GCC's -ftree-vectorize vectorizer, some 20-30% improvements. And the cpu runs hotter during mining along with the higher hash rates which indicates an improvement in efficiency. This is probably a useful thing to have around as manually writing hand optimized assembly e.g. for scrypt-arm.S would likely take a lot of effort and is likely less portable. granted, -ftree-vectorize won't make the fastest codes, but that the improvement is decent with much less manual efforts needed to make optimized assembly codes.

note that neon codes may possibly not work on some ARM cpus which may not support NEON codes, as I think I chanced upon some specs that says A53 cpus the simd extensions is possibly *optional*.
e.g. it is quite possible that some A53 in the wild e.g. the 'cheap' ones may not have NEON in it, even if they are A53 cpus

It used to be that Raspberry Pis are deemed 'too slow' to do mining but Raspberry Pi4 with A72 ARM cores are just borderline and 'punch above its weight' to mine alongside the big Mhash per seconds gpus, the differences is easily 1:1000 though.

pooler

hero member

Activity: 849

Merit: 507

Quote from: sandy25007 on December 07, 2022, 09:06:46 PM

I feel that the cpuminer is not adding fees along with block-reward.

I am looking at coinbase transaction generation, code under section /* build coinbase transaction */

Code:


      tmp = json_object_get( val, "coinbasevalue" );
      if ( !tmp || !json_is_number( tmp ) )
      {
         applog( LOG_ERR, "JSON invalid coinbasevalue" );
         goto out;
      }
      cbvalue = (int64_t) ( json_is_integer( tmp ) ? json_integer_value( tmp )
                                                   : json_number_value( tmp ) );

I feel it is missing the fees addition, below is the code I was expecting to compute and add fees.

[...]

The coinbasevalue field already includes all available transaction fees.

Topic: An (even more) optimized version of cpuminer (pooler's cpuminer, CPU-only) (Read 1958672 times)