Ufasoft Miner - Windows/Linux, x86/x64, SSE2/OpenCL, Open Source - page 20.

ufasoft

sr. member

Activity: 404

Merit: 251

Quote from: d3m0n1q_733rz on February 04, 2012, 01:20:55 PM

So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers? Because I see a few places here where it is. Like here:

Code:

ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

These data are in L1-cache already, so no difference which MOV-instruction loads them.

Anyway, if you are practical programmer, you can do profiling of patched miner. Anything that not tested are just speculations.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: ufasoft on February 04, 2012, 01:11:02 PM

Quote from: d3m0n1q_733rz on February 04, 2012, 12:47:37 PM

nly if you use 256-byte which should be avoided until AVX2. But you're only using XMM, not YMM so none of that comes into play. And sorry, I'm about to hit the wall. I've been up all night. Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet. It's coming. However, you can use the lower half of the YMM by specifying the XMM equivalent. The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX. So,

Hm, the Miner uses XMM registers (SSE2) in current implementation already.

Quote from: d3m0n1q_733rz on February 04, 2012, 12:47:37 PM

As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer. It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.

It is not applicable for SHA256 calculating.

So, you're telling me that we're not moving 16-byte register data into 64-byte xmm registers? Because I see a few places here where it is. Like here:

Code:

ELSE
movdqa xmm3, [zsp+5*16]
movdqa xmm4, [zsp+6*16]
movdqa xmm5, [zsp+7*16]

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

Changed to:

Code:

ELSE
movntdqa xmm3, [zsp+5*16] ;Fetches 64-bytes of zsp and puts into buffer. Reads an extra 16-bytes, but is transferring at 7.5x.
movntdqa xmm4, [zsp+6*16] ;zsp already buffered so doesn't need to be read again.
movntdqa xmm5, [zsp+7*16] ;zsp also buffered so write speed is also increased.

paddd xmm3, [zsi+5*16]
paddd xmm4, [zsi+6*16]
paddd xmm5, [zsi+7*16]

movdqa [zbx+5*16], xmm3
movdqa [zbx+6*16], xmm4
movdqa [zbx+7*16], xmm5

Note that these changes only work for SSE4 so an if-else statement is required.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

LAB_NEXT_NONCE:
   mov      zsi, init

   mov      zbx, w
   mov      zax, pnonce <--
   mov      eax, [zax] <--
   mov      [zbx+3*4], eax

   mov      zcx, 64
   mov      zax, 18 <--
   mov      zdi, 3

I'm seeing a few hiccups here. And, just a make sure, is this code going (from, to) or (to, from)? Different compilers like it different ways. If it's (to, from), then why not just:

LAB_NEXT_NONCE:
   mov      zsi, init

   mov      zbx, w
   mov      eax, pnonce
   mov      [zbx+3*4], eax

   mov      zcx, 64
   mov      zax, 18
   mov      zdi, 3

I'm finding a few of these in the 64-byte write range as well. Granted, I didn't take into account how long it takes to complete an operation, but that shouldn't be an issue here. But yeah, Intel like their sets of 4 instructions sometimes.

ufasoft

sr. member

Activity: 404

Merit: 251

Quote from: d3m0n1q_733rz on February 04, 2012, 12:47:37 PM

nly if you use 256-byte which should be avoided until AVX2. But you're only using XMM, not YMM so none of that comes into play. And sorry, I'm about to hit the wall. I've been up all night. Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet. It's coming. However, you can use the lower half of the YMM by specifying the XMM equivalent. The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX. So,

Hm, the Miner uses XMM registers (SSE2) in current implementation already.

Quote from: d3m0n1q_733rz on February 04, 2012, 12:47:37 PM

As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer. It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.

It is not applicable for SHA256 calculating.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: ufasoft on February 04, 2012, 11:59:38 AM

Quote from: d3m0n1q_733rz on February 04, 2012, 11:47:06 AM

Here's the data on the streaming loads via MOVNTDQA. http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

I agree that MOVNTDQA useful for streaming. But SHA256 calculation is not stream processing. It need not to save anything to RAM.

Quote from: d3m0n1q_733rz on February 04, 2012, 11:47:06 AM

But AVX does speed up the process for integers too. Check out the reference card and search for the word integers. There's also vectorized shuffling, etc. So it's not just for floating point data, but it is mainly geared toward it. But yeah, it's at 128-byte level only as of right now. I can't wait for the AVX2 instruction set to come out.

SHA256 implementations requires integer ADD, Shift, XOR, AND.
Only XOR and AND can be done in AVX.
For other instructions it is necessary to:
1. shuffle high half of YMM to low half,
2. run Integer instruction
3. shuffle the halfs back.

That's only if you use 256-byte which should be avoided until AVX2. But you're only using XMM, not YMM so none of that comes into play. And sorry, I'm about to hit the wall. I've been up all night. Basically, if you're wanting to use the full YMM registers, you can't do much for integers as of yet. It's coming. However, you can use the lower half of the YMM by specifying the XMM equivalent. The XMM registers can use your ADDs, Shifts, XORs and ANDs via AVX. So, when AVX2 comes around, it'll be much simpler to modify the code to use the YMM registers to achieve the same things in 256-byte.
As for the MOVNTDQA, it combines smaller writes into a single stream via a buffer. It's mainly useful for transferring multiple eax into xmm by converting it into a 64-byte write which increases the bandwidth approx. 10x.
The CPU cache is considered memory. But yeah, I try not to memorize every detail of everything because I get confused easily. But the idea is that it combines smaller writes into one large one by streaming them without contaminating the cache.
Here's an example taken from the Intel explanation and modified for better explanation as to how it pertains to the code.

Code:

; This load retrieves a full cache line that is stored in a temporary streaming load
; buffer
; eax is a pointer to the system allocated memory of type USWC
MOVNTDQA xmm0, eax+0
; Subsequent 16-byte loads from the same cache line are supplied from the streaming
; load buffer and occur much faster (as the read is converted from a 16-byte to 64-byte)
MOVNTDQA xmm1, eax+16
MOVNTDQA xmm2, eax+32
MOVNTDQA xmm3, eax+48

ufasoft

sr. member

Activity: 404

Merit: 251

Quote from: d3m0n1q_733rz on February 04, 2012, 11:47:06 AM

Here's the data on the streaming loads via MOVNTDQA. http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

I agree that MOVNTDQA useful for streaming. But SHA256 calculation is not stream processing. It need not to save anything to RAM.

Quote from: d3m0n1q_733rz on February 04, 2012, 11:47:06 AM

But AVX does speed up the process for integers too. Check out the reference card and search for the word integers. There's also vectorized shuffling, etc. So it's not just for floating point data, but it is mainly geared toward it. But yeah, it's at 128-bit level only as of right now. I can't wait for the AVX2 instruction set to come out.

SHA256 implementations requires integer ADD, Shift, XOR, AND.
Only XOR and AND can be done in AVX.
For other instructions it is necessary to:
1. shuffle high half of YMM to low half,
2. run Integer instruction
3. shuffle the halfs back.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Okay, let me try one more time. You can transfer data to the XMM registers for now. Okay, I got ahead of myself with AVX2. But there are integer related commands that will speed up the code like VPADDD which will take two registers and add them to the third. But yeah, AVX2 mainly just allows the extensions to the 256-bit registers instead of just the 128-bit. But AVX does speed up the process for integers too. Check out the reference card and search for the word integers. There's also vectorized shuffling, etc. So it's not just for floating point data, but it is mainly geared toward it. But yeah, it's at 128-bit level only as of right now. I can't wait for the AVX2 instruction set to come out.
One instruction in particular that I've already seen will help is VPAND. And VPADDD will probably come in handy too.
Here's the data on the streaming loads via MOVNTDQA. http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
But yeah, there's vectorized versions of almost all commands up to 128-bit now. And they're not just float-point. Unfortunately, they're probably easier to implement through C than asm. But, again, it would have to be something determined for use based on the processor; probably by intrinsics not available through asm. I'm just saying, as long as everything's aligned, these vectorizations can probably remove a few extra commands as long as the processors can handle them.

ufasoft

sr. member

Activity: 404

Merit: 251

Quote from: d3m0n1q_733rz on February 04, 2012, 09:26:44 AM

The MOVNTDQA does cache, but it caches directly to L1 for immediate use while skipping over the other caches. And

I dont see any performance improvement here. In both cases I have L1-cached data.

But you can try, just patch the .ASM file and build it in Linux.

Quote from: d3m0n1q_733rz on February 04, 2012, 09:26:44 AM

AVX does support integers; it just allows them to be vectorized into 256-bits so you can perform the same calculation

This doc says:
Extensibility: Intel AVX has powerful built-in extensibility options for the future without resorting to code growth:
OS context management rework only needs to be done once.
Future Vector Integer support to 256 and 512 bits

Somewhen in the future AVX will support vector Integers.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: ufasoft on February 04, 2012, 08:34:18 AM

Quote from: d3m0n1q_733rz on February 04, 2012, 08:17:25 AM

upon processor capabilities, you could add quite a few optimizations. For one, streaming moves by using movntdqa to avoid some of the lower caches (if available)

Without caching it will be slower obviously.

Quote from: d3m0n1q_733rz on February 04, 2012, 08:17:25 AM

, using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc. Heck, anymore, CPUs are capable of 256-bit computing.

AVX don't support Integer operations, it has Float-point only ALU.

The MOVNTDQA does cache, but it caches directly to L1 for immediate use while skipping over the other caches. And AVX does support integers; it just allows them to be vectorized into 256-bits so you can perform the same calculation on 4 integers at once. http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/
It does, however, say that float points will benefit the most, but the code can be vectorized for AVX capable processors.

ufasoft

sr. member

Activity: 404

Merit: 251

Quote from: d3m0n1q_733rz on February 04, 2012, 08:17:25 AM

upon processor capabilities, you could add quite a few optimizations. For one, streaming moves by using movntdqa to avoid some of the lower caches (if available)

Without caching it will be slower obviously.

Quote from: d3m0n1q_733rz on February 04, 2012, 08:17:25 AM

, using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc. Heck, anymore, CPUs are capable of 256-bit computing.

AVX don't support Integer operations, it has Float-point only ALU.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Well, it seems to work. However, the 64-bit version appears slower than the 32-bit on Windows 7.
Also, I was going through some of the source code for the assembly--you know, if you added some if statements based upon processor capabilities, you could add quite a few optimizations. For one, streaming moves by using movntdqa to avoid some of the lower caches (if available), using YMM registers which are just combining two XMM registers making the code compatible (if available), use of AVX extensions to combine a few functions (if available), etc. Heck, anymore, CPUs are capable of 256-bit computing. So you can effectively double the output.
Granted, this is based off of the linux source code. But yeah, standard SSE2 is compatible, but a little spice could be nice.

ufasoft

sr. member

Activity: 404

Merit: 251

Quote from: djinfected on February 01, 2012, 06:49:48 AM

You must be using the 64-bit version then. It seems to have a problem with the block updates. Once the block changes, it's not keeping up.

Oh, yeah I am. I'll try the 32-bit version for a while. Thanks.

This x64 CPU-mining bug fixed in 0.27 version

djinfected

newbie

Activity: 24

Merit: 0

Quote from: d3m0n1q_733rz on February 01, 2012, 02:05:39 AM

Quote from: djinfected on January 31, 2012, 08:41:45 PM

Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.

You must be using the 64-bit version then. It seems to have a problem with the block updates. Once the block changes, it's not keeping up.

Oh, yeah I am. I'll try the 32-bit version for a while. Thanks.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: djinfected on January 31, 2012, 08:41:45 PM

Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.

You must be using the 64-bit version then. It seems to have a problem with the block updates. Once the block changes, it's not keeping up.

djinfected

newbie

Activity: 24

Merit: 0

Actually I have a complaint. Your miner will work for about 20 minutes correctly, after which every share is rejected. I am using slush's pool which is accepting most of my GPU work, but your miner accepts no more than 3 shares before every share is rejected.

compro01

hero member

Activity: 590

Merit: 500

Quote from: LightRider on January 25, 2012, 10:25:50 PM

Getting "Found NONCE not accepted by Target" messages on latest version of Ufasoft's miner when connecting to latest win32 version of p2pool.

I'm getting much the same using the jan28 update of the 64-bit client. massive (95 out of 180) amounts of invalids on btcguild, not sure what pool software they use though. using a nvidia 560m GTX and a core i7 QM2670.

going to try the 32-bit client and see if matters improve and if it isn't actually my hardware.

UPDATE: ran the 32-bit client all night. zero invalids out of 276 shares. something is out of whack with the 64-bit client.

djinfected

newbie

Activity: 24

Merit: 0

Over 300% improvement over my old CPU miner thanks a lot! But it's not as good as DiabloMiner for my GPU and it hurts desktop interactivity, so I will be using just for CPU.

Askit2

hero member

Activity: 981

Merit: 500

DIV - Your "Virtual Life" Secured and Decentralize

-a 5 got it as low as 50 good 11 stale. tempted to go back to 32 bit but 64 is a bit faster. I think I will try 32 and see if the problem persists

Askit2

hero member

Activity: 981

Merit: 500

DIV - Your "Virtual Life" Secured and Decentralize

Changing -a 30 to -a 20 has improved my score from 9 to 9 to 14 to 9 it almost looks like bitclockers changed something. I did not double check my results with 32 it could have been a pool issue sorry. I was hoping to limit bandwidth from my client to be nicer to their server as my hashes are slow.

Askit2

hero member

Activity: 981

Merit: 500

DIV - Your "Virtual Life" Secured and Decentralize

I have noticed an issue. Possibly this isn't a common occourance but on my setup I show a 50% stale rate on bitclockers with lp. Should changing to 64 bit take me from 17 good 1 stale to 9 good and 9 stale? Here are the switches I used and I use for both 64 bit and 32 bit:
-a 30 -g yes -T 75 -t 4. GPU usually sits at 74C hashrate tops out at 5.6 on 32 bit and 5.9 on 64 bit. I see nothing that should cause more then an hour between submissions on a 64 bit run that wouldn't on a 32 bit run. My hardware is really slow but it was really slow on 32 bit.
Despite this oddity I still like your miner the best.
Please keep making the best miner for me!

Topic: Ufasoft Miner - Windows/Linux, x86/x64, SSE2/OpenCL, Open Source - page 20. (Read 631149 times)