Pages:
Author

Topic: Ufasoft Miner - Windows/Linux, x86/x64, SSE2/OpenCL, Open Source - page 20. (Read 631037 times)

sr. member
Activity: 404
Merit: 251
So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers?  Because I see a few places here where it is.  Like here:

Code:
ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

These data are in L1-cache already, so no difference which MOV-instruction loads them.

Anyway, if you are practical programmer, you can do profiling of patched miner. Anything that not tested are just  speculations.
sr. member
Activity: 378
Merit: 250
nly if you use 256-byte which should be avoided until AVX2.  But you're only using XMM, not YMM so none of that comes into play.  And sorry, I'm about to hit the wall.  I've been up all night.  Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet.  It's coming.  However, you can use the lower half of the YMM by specifying the XMM equivalent.  The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX.  So,
Hm, the Miner uses XMM registers (SSE2) in current implementation already.

As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer.  It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
It is not applicable for SHA256 calculating.


So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers?  Because I see a few places here where it is.  Like here:


Code:
ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5
Changed to:

Code:
ELSE
movntdqa xmm3, [zsp+5*16]  ;Fetches 64-bytes of zsp and puts into buffer.  Reads an extra 16-bytes, but is transferring at 7.5x.
movntdqa xmm4, [zsp+6*16]  ;zsp already buffered so doesn't need to be read again.
movntdqa xmm5, [zsp+7*16]  ;zsp also buffered so write speed is also increased.

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

Note that these changes only work for SSE4 so an if-else statement is required.
sr. member
Activity: 378
Merit: 250
LAB_NEXT_NONCE:
   mov      zsi, init
   
   mov      zbx, w
   mov      zax, pnonce              <--
   mov      eax, [zax]                <--
   mov      [zbx+3*4], eax

   mov      zcx, 64
   mov      zax, 18             <--
   mov      zdi, 3



I'm seeing a few hiccups here.  And, just a make sure, is this code going (from, to) or (to, from)?  Different compilers like it different ways.  If it's (to, from), then why not just:

LAB_NEXT_NONCE:
   mov      zsi, init
   
   mov      zbx, w
   mov      eax, pnonce
   mov      [zbx+3*4], eax

   mov      zcx, 64
   mov      zax, 18
   mov      zdi, 3

I'm finding a few of these in the 64-byte write range as well.  Granted, I didn't take into account how long it takes to complete an operation, but that shouldn't be an issue here.  But yeah, Intel like their sets of 4 instructions sometimes.
sr. member
Activity: 404
Merit: 251
nly if you use 256-byte which should be avoided until AVX2.  But you're only using XMM, not YMM so none of that comes into play.  And sorry, I'm about to hit the wall.  I've been up all night.  Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet.  It's coming.  However, you can use the lower half of the YMM by specifying the XMM equivalent.  The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX.  So,
Hm, the Miner uses XMM registers (SSE2) in current implementation already.

As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer.  It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
It is not applicable for SHA256 calculating.

sr. member
Activity: 378
Merit: 250

I agree that MOVNTDQA useful for streaming. But SHA256 calculation is not stream processing. It need not to save anything to RAM.

But AVX does speed up the process for integers too.  Check out the reference card and search for the word integers.  There's also vectorized shuffling, etc.  So it's not just for floating point data, but it is mainly geared toward it.  But yeah, it's at 128-byte level only as of right now.  I can't wait for the AVX2 instruction set to come out.

SHA256 implementations requires integer ADD, Shift, XOR, AND.
Only XOR and AND can be done in AVX.
For other instructions it is necessary to:
1. shuffle high half of YMM to low half,
2. run Integer instruction
3. shuffle the halfs back.

That's only if you use 256-byte which should be avoided until AVX2.  But you're only using XMM, not YMM so none of that comes into play.  And sorry, I'm about to hit the wall.  I've been up all night.  Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet.  It's coming.  However, you can use the lower half of the YMM by specifying the XMM equivalent.  The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX.  So, when AVX2 comes around, it'll be much simpler to modify the code to use the YMM registers to achieve the same things in 256-byte.
As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer.  It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
The CPU cache is considered memory.  But yeah, I try not to memorize every detail of everything because I get confused easily.  But the idea is that it combines smaller writes into one large one by streaming them without contaminating the cache.
Here's an example taken from the Intel explanation and modified for better explanation as to how it pertains to the code.


Code:
; This load retrieves a full cache line that is stored in a temporary streaming load
; buffer
; eax is a pointer to the system allocated memory of type USWC
              MOVNTDQA xmm0, eax+0
; Subsequent 16-byte loads from the same cache line are supplied from the streaming
; load buffer and occur much faster (as the read is converted from a 16-byte to 64-byte)
              MOVNTDQA xmm1, eax+16
              MOVNTDQA xmm2, eax+32
              MOVNTDQA xmm3, eax+48
sr. member
Activity: 404
Merit: 251

I agree that MOVNTDQA useful for streaming. But SHA256 calculation is not stream processing. It need not to save anything to RAM.

But AVX does speed up the process for integers too.  Check out the reference card and search for the word integers.  There's also vectorized shuffling, etc.  So it's not just for floating point data, but it is mainly geared toward it.  But yeah, it's at 128-bit level only as of right now.  I can't wait for the AVX2 instruction set to come out.

SHA256 implementations requires integer ADD, Shift, XOR, AND.
Only XOR and AND can be done in AVX.
For other instructions it is necessary to:
1. shuffle high half of YMM to low half,
2. run Integer instruction
3. shuffle the halfs back.
sr. member
Activity: 378
Merit: 250
Okay, let me try one more time.  You can transfer data to the XMM registers for now.  Okay, I got ahead of myself with AVX2.  But there are integer related commands that will speed up the code like VPADDD which will take two registers and add them to the third.  But yeah, AVX2 mainly just allows the extensions to the 256-bit registers instead of just the 128-bit.  But AVX does speed up the process for integers too.  Check out the reference card and search for the word integers.  There's also vectorized shuffling, etc.  So it's not just for floating point data, but it is mainly geared toward it.  But yeah, it's at 128-bit level only as of right now.  I can't wait for the AVX2 instruction set to come out.
One instruction in particular that I've already seen will help is VPAND.  And VPADDD will probably come in handy too.
Here's the data on the streaming loads via MOVNTDQA.  http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
But yeah, there's vectorized versions of almost all commands up to 128-bit now.  And they're not just float-point.  Unfortunately, they're probably easier to implement through C than asm.  But, again, it would have to be something determined for use based on the processor; probably by intrinsics not available through asm.  I'm just saying, as long as everything's aligned, these vectorizations can probably remove a few extra commands as long as the processors can handle them.
sr. member
Activity: 404
Merit: 251
The MOVNTDQA does cache, but it caches directly to L1 for immediate use while skipping over the other caches.  And
I dont see any performance improvement here. In both cases I have L1-cached data.

But you can try, just patch the .ASM file and build it in Linux.

AVX does support integers; it just allows them to be vectorized into 256-bits so you can perform the same calculation
This doc says:
Extensibility: Intel AVX has powerful built-in extensibility options for the future without resorting to code growth:
OS context management rework only needs to be done once.
Future Vector Integer support to 256 and 512 bits


Somewhen in the future AVX will support vector Integers.
sr. member
Activity: 378
Merit: 250
upon processor capabilities, you could add quite a few optimizations.  For one, streaming moves by using movntdqa to avoid some of the lower caches (if available)
Without caching it will be slower obviously.

, using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc.  Heck, anymore, CPUs are capable of 256-bit computing. 
AVX don't support Integer operations, it has Float-point only ALU.
The MOVNTDQA does cache, but it caches directly to L1 for immediate use while skipping over the other caches.  And AVX does support integers; it just allows them to be vectorized into 256-bits so you can perform the same calculation on 4 integers at once.  http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/
It does, however, say that float points will benefit the most, but the code can be vectorized for AVX capable processors.
sr. member
Activity: 404
Merit: 251
upon processor capabilities, you could add quite a few optimizations.  For one, streaming moves by using movntdqa to avoid some of the lower caches (if available)
Without caching it will be slower obviously.

, using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc.  Heck, anymore, CPUs are capable of 256-bit computing. 
AVX don't support Integer operations, it has Float-point only ALU.
sr. member
Activity: 378
Merit: 250
Well, it seems to work.  However, the 64-bit version appears slower than the 32-bit on Windows 7.
Also, I was going through some of the source code for the assembly--you know, if you added some if statements based upon processor capabilities, you could add quite a few optimizations.  For one, streaming moves by using movntdqa to avoid some of the lower caches (if available), using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc.  Heck, anymore, CPUs are capable of 256-bit computing.  So you can effectively double the output.
Granted, this is based off of the linux source code.  But yeah, standard SSE2 is compatible, but a little spice could be nice.
sr. member
Activity: 404
Merit: 251
You must be using the 64-bit version then.  It seems to have a problem with the block updates.  Once the block changes, it's not keeping up.

Oh, yeah I am. I'll try the 32-bit version for a while. Thanks.

This x64 CPU-mining bug fixed in 0.27 version
newbie
Activity: 24
Merit: 0
Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.
You must be using the 64-bit version then.  It seems to have a problem with the block updates.  Once the block changes, it's not keeping up.
Oh, yeah I am. I'll try the 32-bit version for a while. Thanks.
sr. member
Activity: 378
Merit: 250
Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.
You must be using the 64-bit version then.  It seems to have a problem with the block updates.  Once the block changes, it's not keeping up.
newbie
Activity: 24
Merit: 0
Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.
hero member
Activity: 590
Merit: 500
Getting "Found NONCE not accepted by Target" messages on latest version of Ufasoft's miner when connecting to latest win32 version of p2pool.

I'm getting much the same using the jan28 update of the 64-bit client.  massive (95 out of 180) amounts of invalids on btcguild, not sure what pool software they use though.  using a nvidia 560m GTX and a core i7 QM2670.

going to try the 32-bit client and see if matters improve and if it isn't actually my hardware.

UPDATE: ran the 32-bit client all night.  zero invalids out of 276 shares.  something is out of whack with the 64-bit client.
newbie
Activity: 24
Merit: 0
Over 300% improvement over my old CPU miner thanks a lot! But it's not as good as DiabloMiner for my GPU and it hurts desktop interactivity, so I will be using just for CPU.
hero member
Activity: 981
Merit: 500
DIV - Your "Virtual Life" Secured and Decentralize
-a 5 got it as low as 50 good 11 stale. tempted to go back to 32 bit but 64 is a bit faster. I think I will try 32 and see if the problem persists
hero member
Activity: 981
Merit: 500
DIV - Your "Virtual Life" Secured and Decentralize
Changing -a 30 to -a 20 has improved my score from 9 to 9 to 14 to 9 it almost looks like bitclockers changed something. I did not double check my results with 32 it could have been a pool issue sorry. I was hoping to limit bandwidth from my client to be nicer to their server as my hashes are slow.
hero member
Activity: 981
Merit: 500
DIV - Your "Virtual Life" Secured and Decentralize
I have noticed an issue. Possibly this isn't a common occourance but on my setup I show a 50% stale rate on bitclockers with lp.  Should changing to 64 bit take me from 17 good 1 stale to 9 good and  9 stale?  Here are the switches I used and I use for both 64 bit and 32 bit:
-a 30 -g yes -T 75 -t 4. GPU usually sits at 74C hashrate tops out at 5.6 on 32 bit and 5.9 on 64 bit. I see nothing that should cause more then an hour between submissions on a 64 bit run that wouldn't on a 32 bit run. My hardware is really slow but it was really slow on 32 bit.
Despite this oddity I still like your miner the best.
Please keep making the best miner for me!
Pages:
Jump to: