OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 817.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: RudeDude on July 31, 2011, 09:29:41 AM

Quote from: d3m0n1q_733rz on July 31, 2011, 07:02:13 AM

Quote from: zaytsev on July 31, 2011, 04:21:26 AM

d3m0n1q_733rz, why wouldn't you create a fork on github? would be easier then copy-pasting and less error-prone.

A) I can't actually program from scratch and most of my changes or just logic based.
B) Almost nobody gives me input on how my changes affect their hashing anyway.
C) People might end up sending me incessant requests for changes that I couldn't keep up with.

Sorry to say but that version dropped performance by about 3-4%. There is definitely some variance in the cgminer speed reporting from moment to moment (from 6.7 to 7.4 Mh/s total for this 2 core setup) but over a ~2-3min period it seems to average out to reliable numbers.

Since I'm providing some feedback I should prolly tell you some hardware & compile details:

Code:

CFLAGS = -O3 -ffast-math -funroll-loops -mtune=native -march=native -msahf

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
stepping        : 11
cpu MHz         : 2992.227
cache size      : 4096 KB
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon
pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm
dca lahf_lm tpr_shadow vnmi flexpriority

Yeah, the drop in performance is caused by a little taboo I ran into in the long hours of the night. I forgot the 3-1 rule in relation to clock cycles. In other words, I streamed together too many "expensive" commands without giving them time to complete before the next set. I traded one optimization for another that didn't work as well.

RudeDude

newbie

Activity: 11

Merit: 0

Quote from: d3m0n1q_733rz on July 31, 2011, 07:02:13 AM

Quote from: zaytsev on July 31, 2011, 04:21:26 AM

d3m0n1q_733rz, why wouldn't you create a fork on github? would be easier then copy-pasting and less error-prone.

A) I can't actually program from scratch and most of my changes or just logic based.
B) Almost nobody gives me input on how my changes affect their hashing anyway.
C) People might end up sending me incessant requests for changes that I couldn't keep up with.

Sorry to say but that version dropped performance by about 3-4%. There is definitely some variance in the cgminer speed reporting from moment to moment (from 6.7 to 7.4 Mh/s total for this 2 core setup) but over a ~2-3min period it seems to average out to reliable numbers.

Since I'm providing some feedback I should prolly tell you some hardware & compile details:

Code:

CFLAGS = -O3 -ffast-math -funroll-loops -mtune=native -march=native -msahf

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz
stepping        : 11
cpu MHz         : 2992.227
cache size      : 4096 KB
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon
pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm
dca lahf_lm tpr_shadow vnmi flexpriority

zaytsev

newbie

Activity: 59

Merit: 0

The point is that with a fork on github I can see exactly what you have changed as compared to the original files and also easily download your latest changes without tedious copy-pasting from the forum. Also you can easily pull ck's changes into your branch (single command needed for that) and when it will be ready you can just give him the changed files.

A fork on github doesn't mean that you are taking the code away and starting your own project, it's just a way to easily publish your changes for others to test and submit to the original project when ready.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: zaytsev on July 31, 2011, 04:21:26 AM

d3m0n1q_733rz, why wouldn't you create a fork on github? would be easier then copy-pasting and less error-prone.

A) I can't actually program from scratch and most of my changes or just logic based.
B) Almost nobody gives me input on how my changes affect their hashing anyway.
C) People might end up sending me incessant requests for changes that I couldn't keep up with.

Besides, this is more of a hobby for me than an outright project and I want to be able to drop it like one without people getting caught in the wake. Wink

In related news:

Code:

;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:[email protected]
; Version 2011
; This software is Public Domain

; Significant re-write/optimisation and reordering by,
; Neil Kettle 
; Small modifications played around with by,
; Erick Couts II 
; ~18% performance improvement

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16))
%define LAB_CALC_PARA	2
%define LAB_CALC_UNROLL	8

%define LAB_LOOP_UNROLL 8

extern g_4sha256_k

global CalcSha256_x64_sse4
;	CalcSha256	hash(rdi), data(rsi), init(rdx)
CalcSha256_x64_sse4:

	push	rbx

LAB_NEXT_NONCE:

	mov	rcx, 256					; 256 - rcx is # of SHA-2 rounds
;	mov	rax, 64					; 64 - rax is where we expand to

LAB_SHA:
	push	rcx
	lea	rcx, qword [data+1024]				; + 1024
	lea	r11, qword [data+256]				; + 256

LAB_CALC:
%macro	lab_calc_blk 1
;	prefetcht0	[r11-(15-%1)*16]
;	prefetcht0	[r11-(15-(%1+1))*16]

	movntdqa	xmm0, [r11-(15-%1)*16]				; xmm0 = W[I-15]
	movdqa	xmm1, xmm0				; xmm1 = W[I-15]
	movdqa	xmm2, xmm0				; xmm2 = W[I-15]
	movntdqa	xmm3, [r11-(2-%1)*16]				; xmm3 = W[I-2]
	movntdqa	xmm4, [r11-(15-(%1+1))*16]			; xmm4 = W[I-15+1]
	movdqa	xmm5, xmm4			; xmm4 = W[I-15+1]
	movtdqa	xmm6, xmm4			; xmm6 = W[I-15+1]
	movntdqa	xmm7, [r11-(2-(%1+1))*16]			; xmm7 = W[I-2+1]

;	movdqa	xmm2, xmm0					; xmm2 = W[I-15]	
;	movdqa	xmm6, xmm4					; xmm6 = W[I-15+1]	

	psrld	xmm0, 3						; xmm0 = W[I-15] >> 3
	psrld	xmm1, 7						; xmm1 = W[I-15] >> 7 (Moved and made it independent of xmm0)
	psrld	xmm4, 3						; xmm4 = W[I-15+1] >> 3
	psrld	xmm5, 7						; xmm5 = W[I-15+1] >> 7	
	pslld	xmm2, 14					; xmm2 = W[I-15] << 14			

;	movdqa	xmm5, xmm4					; xmm5 = W[I-15+1] >> 3
;	movdqa	xmm1, xmm0					; xmm1 = W[I-15] >> 3	

	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7)	
	pslld	xmm6, 14					; xmm6 = W[I-15+1] << 14

	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14)
	psrld	xmm1, 11					; xmm1 = W[I-15] >> 18
	psrld	xmm5, 11					; xmm5 = W[I-15+1] >> 18
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14)
	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18)	
	pslld	xmm2, 11					; xmm2 = W[I-15] << 25
	pslld	xmm6, 11					; xmm6 = W[I-15+1] << 25
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)
	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25)
	paddd	xmm0, [r11-(16-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16]
	paddd	xmm4, [r11-(16-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1]


;;;;;;;;;;;;;;;;;;

	movdqa	xmm2, xmm3					; xmm2 = W[I-2]
	movdqa	xmm1, xmm3					; xmm1 = W[I-2] >> 10
	movdqa	xmm6, xmm7					; xmm6 = W[I-2+1]
	movdqa	xmm5, xmm7					; xmm5 = W[I-2+1] >> 10


	paddd	xmm0, [r11-(7-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]
	paddd	xmm4, [r11-(7-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]
	
	psrld	xmm1, 17					; xmm1 = W[I-2] >> 17
	pslld	xmm2, 13					; xmm2 = W[I-2] << 13
	psrld	xmm3, 10					; xmm3 = W[I-2] >> 10
	psrld	xmm5, 17					; xmm5 = W[I-2+1] >> 17
	pslld	xmm6, 13					; xmm6 = W[I-2+1] << 13
	psrld	xmm7, 10					; xmm7 = W[I-2+1] >> 10

	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17)
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13)
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17)
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13)

	psrld	xmm1, 2						; xmm1 = W[I-2] >> 19
	psrld	xmm5, 2						; xmm5 = W[I-2+1] >> 19	
	pslld	xmm2, 2						; xmm2 = W[I-2] << 15
	pslld	xmm6, 2						; xmm6 = W[I-2+1] << 15

	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19)
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15)
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19)	
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)
	paddd	xmm0, xmm3					; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7]
	paddd	xmm4, xmm7					; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1]

	movdqa	[r11+((%1+1)*16)], xmm4
	movdqa	[r11+(%1*16)], xmm0
%endmacro

%assign i 0
%rep    LAB_CALC_UNROLL
        lab_calc_blk i
%assign i i+LAB_CALC_PARA
%endrep
;	prefetchnta	[rcx]

	add	r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
	cmp	r11, rcx
	jb	LAB_CALC
	prefetchnta	[init+16]
	pop	rcx
	mov	rax, 0

; Load the init values of the message into the hash.

	movntdqa	xmm7, [init]
	movntdqa	xmm0, [init+16]
	pshufd	xmm5, xmm7, 0x55		; xmm5 == b
	pshufd	xmm4, xmm7, 0xAA		; xmm4 == c
	pshufd	xmm3, xmm7, 0xFF		; xmm3 == d
	pshufd	xmm7, xmm7, 0			; xmm7 == a
	pshufd	xmm8, xmm0, 0x55		; xmm8 == f
	pshufd	xmm9, xmm0, 0xAA		; xmm9 == g
	pshufd	xmm10, xmm0, 0xFF		; xmm10 == h
	pshufd	xmm0, xmm0, 0			; xmm0 == e

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32(g_sha256_k[j]) + w[j]

%macro	lab_loop_blk 0
;	prefetchnta	[rax*4]
	movntdqa	xmm6, [data+rax*4]
	paddd	xmm6, g_4sha256_k[rax*4]
	add	rax, 4

	paddd	xmm6, xmm10	; +h

	movdqa	xmm1, xmm0
;	movdqa	xmm2, xmm9	; It's redundant unless xmm9 becomes a destination
	movdqa	xmm10, xmm9	; h = g  Changed from xmm2 to xmm9
	movdqa	xmm9, xmm8	; f
	movdqa	xmm2, xmm8	; g = f	xmm9 became a destination but not until xmm2 was already used and replaced

	pand	xmm2, xmm0	; e & f
	pandn	xmm1, xmm10	; ~e & g Changed from xmm2 to xmm9 (see above reason) then xmm10 to combine writes
	pxor	xmm1, xmm2	; (e & f) ^ (~e & g)
	paddd	xmm6, xmm1	; Ch + h + w[i] + k[i]

	movdqa	xmm1, xmm0
	movdqa	xmm2, xmm0
	movdqa	xmm8, xmm0	; f = e Combining these three moves for processor hardware optimization
	psrld	xmm0, 6		; The xmm2 from xmm0 movdqa used to be after this taking advantage of the r-rotate 6
	psrld	xmm2, 11	; Changed from 5 to 11 after shoving the movdqa commands together
	pslld	xmm1, 7
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 14
	psrld	xmm2, 14
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 5
	pxor	xmm0, xmm1	; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
	paddd	xmm6, xmm0	; xmm6 = t1
	paddd	xmm3, xmm6	; e = d+t1

	movdqa	xmm0, xmm3	; d
	movdqa	xmm1, xmm5	; =b
	movdqa	xmm2, xmm4	; c
	movdqa	xmm3, xmm2	; d = c
	pand	xmm2, xmm5	; b & c
	pand	xmm4, xmm7	; a & c
	pand	xmm1, xmm7	; a & b
	pxor	xmm1, xmm4
	pxor	xmm1, xmm2	; (a & c) ^ (a & d) ^ (c & d)
	paddd	xmm6, xmm1	; t1 + ((a & c) ^ (a & d) ^ (c & d))

	movdqa	xmm4, xmm5	; c = b
	movdqa	xmm5, xmm7	; b = a
	movdqa	xmm2, xmm7
	movdqa	xmm1, xmm7
	psrld	xmm1, 13
	psrld	xmm7, 2
	pslld	xmm2, 10
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 9
	psrld	xmm1, 9
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 11
	pxor	xmm7, xmm2
	paddd	xmm7, xmm6	; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));
%endmacro

%assign i 0
%rep    LAB_LOOP_UNROLL
        lab_loop_blk
%assign i i+1
%endrep

	cmp	rax, rcx
	jb	LAB_LOOP

; Finished the 64 rounds, calculate hash and save

	movntdqa	xmm1, [rdx]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm5, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm4, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm3, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm7, xmm1

	movntdqa	xmm1, [rdx+16]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm8, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm9, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm10, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm0, xmm1

	movdqa	[hash], xmm7
	movdqa	[hash+16], xmm5
	movdqa	[hash+32], xmm4
	movdqa	[hash+48], xmm3
	movdqa	[hash+64], xmm0
	movdqa	[hash+80], xmm8
	movdqa	[hash+96], xmm9
	movdqa	[hash+112], xmm10

LAB_RET:
	pop	rbx
	ret

I've commented out some of the optimizations I've been playing around with so you can see what I've been trying. It seemed like the prefetches actually slowed the code down for me. AMD users might have different results. Here, I've taken the liberty of even supplying the AMD users with the SSE2 code for ease of use. I ended up leaving in the loop modifications I made just because I couldn't tell much difference honestly. But I'm going to bed.

Code:

;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:[email protected]
; Version 2011
; This software is Public Domain

; Significant re-write/optimisation and reordering by,
; Neil Kettle 
; Small modifications played around with by,
; Erick Couts II 
; ~18% performance improvement

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16))
%define LAB_CALC_PARA	2
%define LAB_CALC_UNROLL	8

%define LAB_LOOP_UNROLL 8

extern g_4sha256_k

global CalcSha256_x64
;	CalcSha256	hash(rdi), data(rsi), init(rdx)
CalcSha256_x64:

	push	rbx

LAB_NEXT_NONCE:

	mov	rcx, 256					; 256 - rcx is # of SHA-2 rounds
;	mov	rax, 64					; 64 - rax is where we expand to

LAB_SHA:
	push	rcx
	lea	rcx, qword [data+1024]				; + 1024
	lea	r11, qword [data+256]				; + 256

LAB_CALC:
%macro	lab_calc_blk 1
;	prefetcht0	[r11-(15-%1)*16]
;	prefetcht0	[r11-(15-(%1+1))*16]

	movdqa	xmm0, [r11-(15-%1)*16]				; xmm0 = W[I-15]
	movdqa	xmm1, xmm0				; xmm1 = W[I-15]
	movdqa	xmm2, xmm0				; xmm2 = W[I-15]
	movdqa	xmm3, [r11-(2-%1)*16]				; xmm3 = W[I-2]
	movdqa	xmm4, [r11-(15-(%1+1))*16]			; xmm4 = W[I-15+1]
	movdqa	xmm5, xmm4			; xmm5 = W[I-15+1]
	movdqa	xmm6, xmm4			; xmm6 = W[I-15+1]
	movdqa	xmm7, [r11-(2-(%1+1))*16]			; xmm7 = W[I-2+1]

;	movdqa	xmm2, xmm0					; xmm2 = W[I-15]	
;	movdqa	xmm6, xmm4					; xmm6 = W[I-15+1]	

	psrld	xmm0, 3						; xmm0 = W[I-15] >> 3
	psrld	xmm1, 7						; xmm1 = W[I-15] >> 7 (Moved and made it independent of xmm0)
	psrld	xmm4, 3						; xmm4 = W[I-15+1] >> 3
	psrld	xmm5, 7						; xmm5 = W[I-15+1] >> 7	
	pslld	xmm2, 14					; xmm2 = W[I-15] << 14			

;	movdqa	xmm5, xmm4					; xmm5 = W[I-15+1] >> 3
;	movdqa	xmm1, xmm0					; xmm1 = W[I-15] >> 3	

	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7)	
	pslld	xmm6, 14					; xmm6 = W[I-15+1] << 14

	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14)
	psrld	xmm1, 11					; xmm1 = W[I-15] >> 18
	psrld	xmm5, 11					; xmm5 = W[I-15+1] >> 18
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14)
	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18)	
	pslld	xmm2, 11					; xmm2 = W[I-15] << 25
	pslld	xmm6, 11					; xmm6 = W[I-15+1] << 25
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)
	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25)
	paddd	xmm0, [r11-(16-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16]
	paddd	xmm4, [r11-(16-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1]


;;;;;;;;;;;;;;;;;;

	movdqa	xmm2, xmm3					; xmm2 = W[I-2]
	movdqa	xmm1, xmm3					; xmm1 = W[I-2] >> 10
	movdqa	xmm6, xmm7					; xmm6 = W[I-2+1]
	movdqa	xmm5, xmm7					; xmm5 = W[I-2+1] >> 10


	paddd	xmm0, [r11-(7-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]
	paddd	xmm4, [r11-(7-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]
	
	psrld	xmm1, 17					; xmm1 = W[I-2] >> 17
	psrld	xmm3, 10					; xmm3 = W[I-2] >> 10
	psrld	xmm5, 17					; xmm5 = W[I-2+1] >> 17
	psrld	xmm7, 10					; xmm7 = W[I-2+1] >> 10
	pslld	xmm2, 13					; xmm2 = W[I-2] << 13
	pslld	xmm6, 13					; xmm6 = W[I-2+1] << 13

	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17)
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13)
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17)
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13)

	psrld	xmm1, 2						; xmm1 = W[I-2] >> 19
	psrld	xmm5, 2						; xmm5 = W[I-2+1] >> 19	
	pslld	xmm2, 2						; xmm2 = W[I-2] << 15
	pslld	xmm6, 2						; xmm6 = W[I-2+1] << 15

	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19)
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15)
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19)	
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)
	paddd	xmm0, xmm3					; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7]
	paddd	xmm4, xmm7					; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1]

	movdqa	[r11+((%1+1)*16)], xmm4
	movdqa	[r11+(%1*16)], xmm0
%endmacro

%assign i 0
%rep    LAB_CALC_UNROLL
        lab_calc_blk i
%assign i i+LAB_CALC_PARA
%endrep
;	prefetchnta	[rcx]

	add	r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
	cmp	r11, rcx
	jb	LAB_CALC
	prefetchnta	[init+16]
	pop	rcx
	mov	rax, 0

; Load the init values of the message into the hash.

	movdqa	xmm7, [init]
	movdqa	xmm0, [init+16]
	pshufd	xmm5, xmm7, 0x55		; xmm5 == b
	pshufd	xmm4, xmm7, 0xAA		; xmm4 == c
	pshufd	xmm3, xmm7, 0xFF		; xmm3 == d
	pshufd	xmm7, xmm7, 0			; xmm7 == a
	pshufd	xmm8, xmm0, 0x55		; xmm8 == f
	pshufd	xmm9, xmm0, 0xAA		; xmm9 == g
	pshufd	xmm10, xmm0, 0xFF		; xmm10 == h
	pshufd	xmm0, xmm0, 0			; xmm0 == e

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32(g_sha256_k[j]) + w[j]

%macro	lab_loop_blk 0
;	prefetchnta	[rax*4]
	movdqa	xmm6, [data+rax*4]
	paddd	xmm6, g_4sha256_k[rax*4]
	add	rax, 4

	paddd	xmm6, xmm10	; +h

	movdqa	xmm1, xmm0
;	movdqa	xmm2, xmm9	; It's redundant unless xmm9 becomes a destination
	movdqa	xmm10, xmm9	; h = g  Changed from xmm2 to xmm9
	movdqa	xmm9, xmm8	; f
	movdqa	xmm2, xmm8	; g = f	xmm9 became a destination but not until xmm2 was already used and replaced

	pand	xmm2, xmm0	; e & f
	pandn	xmm1, xmm10	; ~e & g Changed from xmm2 to xmm9 (see above reason) then xmm10 to combine writes
	pxor	xmm1, xmm2	; (e & f) ^ (~e & g)
	paddd	xmm6, xmm1	; Ch + h + w[i] + k[i]

	movdqa	xmm1, xmm0
	movdqa	xmm2, xmm0
	movdqa	xmm8, xmm0	; f = e Combining these three moves for processor hardware optimization
	psrld	xmm0, 6		; The xmm2 from xmm0 movdqa used to be after this taking advantage of the r-rotate 6
	psrld	xmm2, 11	; Changed from 5 to 11 after shoving the movdqa commands together
	pslld	xmm1, 7
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 14
	psrld	xmm2, 14
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 5
	pxor	xmm0, xmm1	; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
	paddd	xmm6, xmm0	; xmm6 = t1
	paddd	xmm3, xmm6	; e = d+t1

	movdqa	xmm0, xmm3	; d
	movdqa	xmm1, xmm5	; =b
	movdqa	xmm2, xmm4	; c
	movdqa	xmm3, xmm2	; d = c
	pand	xmm2, xmm5	; b & c
	pand	xmm4, xmm7	; a & c
	pand	xmm1, xmm7	; a & b
	pxor	xmm1, xmm4
	pxor	xmm1, xmm2	; (a & c) ^ (a & d) ^ (c & d)
	paddd	xmm6, xmm1	; t1 + ((a & c) ^ (a & d) ^ (c & d))

	movdqa	xmm4, xmm5	; c = b
	movdqa	xmm5, xmm7	; b = a
	movdqa	xmm2, xmm7
	movdqa	xmm1, xmm7
	psrld	xmm1, 13
	psrld	xmm7, 2
	pslld	xmm2, 10
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 9
	psrld	xmm1, 9
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 11
	pxor	xmm7, xmm2
	paddd	xmm7, xmm6	; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));
%endmacro

%assign i 0
%rep    LAB_LOOP_UNROLL
        lab_loop_blk
%assign i i+1
%endrep

	cmp	rax, rcx
	jb	LAB_LOOP

; Finished the 64 rounds, calculate hash and save

	movdqa	xmm1, [rdx]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm5, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm4, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm3, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm7, xmm1

	movdqa	xmm1, [rdx+16]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm8, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm9, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm10, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm0, xmm1

	movdqa	[hash], xmm7
	movdqa	[hash+16], xmm5
	movdqa	[hash+32], xmm4
	movdqa	[hash+48], xmm3
	movdqa	[hash+64], xmm0
	movdqa	[hash+80], xmm8
	movdqa	[hash+96], xmm9
	movdqa	[hash+112], xmm10

LAB_RET:
	pop	rbx
	ret

zaytsev

newbie

Activity: 59

Merit: 0

d3m0n1q_733rz, why wouldn't you create a fork on github? would be easier then copy-pasting and less error-prone.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Code:

;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:[email protected]
; Version 2011
; This software is Public Domain

; Significant re-write/optimisation and reordering by,
; Neil Kettle 
; ~18% performance improvement

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16))
%define LAB_CALC_PARA	2
%define LAB_CALC_UNROLL	8

%define LAB_LOOP_UNROLL 8

extern g_4sha256_k

global CalcSha256_x64_sse4
;	CalcSha256	hash(rdi), data(rsi), init(rdx)
CalcSha256_x64_sse4:

	push	rbx

LAB_NEXT_NONCE:

	mov	rcx, 256					; 256 - rcx is # of SHA-2 rounds
;	mov	rax, 64					; 64 - rax is where we expand to

LAB_SHA:
	push	rcx
	lea	rcx, qword [data+1024]				; + 1024
	lea	r11, qword [data+256]				; + 256

LAB_CALC:
%macro	lab_calc_blk 1

	movntdqa	xmm0, [r11-(15-%1)*16]				; xmm0 = W[I-15]
	movntdqa	xmm1, [r11-(15-%1)*16]				; xmm1 = W[I-15]
	movntdqa	xmm2, [r11-(15-%1)*16]				; xmm2 = W[I-15]
	movntdqa	xmm3, [r11-(2-%1)*16]				; xmm3 = W[I-2]
	movntdqa	xmm4, [r11-(15-(%1+1))*16]			; xmm4 = W[I-15+1]
	movntdqa	xmm5, [r11-(15-(%1+1))*16]			; xmm4 = W[I-15+1]
	movntdqa	xmm6, [r11-(15-(%1+1))*16]			; xmm6 = W[I-15+1]
	movntdqa	xmm7, [r11-(2-(%1+1))*16]			; xmm7 = W[I-2+1]

;	movdqa	xmm2, xmm0					; xmm2 = W[I-15]	
;	movdqa	xmm6, xmm4					; xmm6 = W[I-15+1]	

	psrld	xmm0, 3						; xmm0 = W[I-15] >> 3
	psrld	xmm1, 7						; xmm1 = W[I-15] >> 7 Moved and made it independent of xmm0
	psrld	xmm4, 3						; xmm4 = W[I-15+1] >> 3
	psrld	xmm5, 7						; xmm5 = W[I-15+1] >> 7	
	pslld	xmm2, 14					; xmm2 = W[I-15] << 14			

;	movdqa	xmm5, xmm4					; xmm5 = W[I-15+1] >> 3
;	movdqa	xmm1, xmm0					; xmm1 = W[I-15] >> 3	

	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7)	
	pslld	xmm6, 14					; xmm6 = W[I-15+1] << 14

	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14)
	psrld	xmm1, 11					; xmm1 = W[I-15] >> 18
	psrld	xmm5, 11					; xmm5 = W[I-15+1] >> 18
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14)
	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18)	
	pslld	xmm2, 11					; xmm2 = W[I-15] << 25
	pslld	xmm6, 11					; xmm6 = W[I-15+1] << 25
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)
	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25)
	paddd	xmm0, [r11-(16-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16]
	paddd	xmm4, [r11-(16-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1]


;;;;;;;;;;;;;;;;;;

	movdqa	xmm2, xmm3					; xmm2 = W[I-2]
	psrld	xmm3, 10					; xmm3 = W[I-2] >> 10
	movdqa	xmm1, xmm3					; xmm1 = W[I-2] >> 10
	movdqa	xmm6, xmm7					; xmm6 = W[I-2+1]
	psrld	xmm7, 10					; xmm7 = W[I-2+1] >> 10
	movdqa	xmm5, xmm7					; xmm5 = W[I-2+1] >> 10

	paddd	xmm0, [r11-(7-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]
	paddd	xmm4, [r11-(7-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]
	
	pslld	xmm2, 13					; xmm2 = W[I-2] << 13
	pslld	xmm6, 13					; xmm6 = W[I-2+1] << 13
	psrld	xmm1, 7						; xmm1 = W[I-2] >> 17
	psrld	xmm5, 7						; xmm5 = W[I-2+1] >> 17



	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17)
	psrld	xmm1, 2						; xmm1 = W[I-2] >> 19
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13)
	pslld	xmm2, 2						; xmm2 = W[I-2] << 15
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17)
	psrld	xmm5, 2						; xmm5 = W[I-2+1] >> 19	
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13)
	pslld	xmm6, 2						; xmm6 = W[I-2+1] << 15



	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19)
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15)
	paddd	xmm0, xmm3					; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7]
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19)	
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)
	paddd	xmm4, xmm7					; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1]

	movdqa	[r11+(%1*16)], xmm0
	movdqa	[r11+((%1+1)*16)], xmm4
%endmacro

%assign i 0
%rep    LAB_CALC_UNROLL
        lab_calc_blk i
%assign i i+LAB_CALC_PARA
%endrep

	add	r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
	cmp	r11, rcx
	jb	LAB_CALC

	pop	rcx
	mov	rax, 0

; Load the init values of the message into the hash.

	movntdqa	xmm7, [init]
	movntdqa	xmm0, [init+16]
	pshufd	xmm5, xmm7, 0x55		; xmm5 == b
	pshufd	xmm4, xmm7, 0xAA		; xmm4 == c
	pshufd	xmm3, xmm7, 0xFF		; xmm3 == d
	pshufd	xmm7, xmm7, 0			; xmm7 == a
	pshufd	xmm8, xmm0, 0x55		; xmm8 == f
	pshufd	xmm9, xmm0, 0xAA		; xmm9 == g
	pshufd	xmm10, xmm0, 0xFF		; xmm10 == h
	pshufd	xmm0, xmm0, 0			; xmm0 == e

LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32(g_sha256_k[j]) + w[j]

%macro	lab_loop_blk 0				; Notice the macro! rax*4 isn't redundant here.
	movntdqa	xmm6, [data+rax*4]
	paddd	xmm6, g_4sha256_k[rax*4]
	add	rax, 4

	paddd	xmm6, xmm10	; +h

	movdqa	xmm1, xmm0
;	movdqa	xmm2, xmm9	; It's redundant unless xmm9 becomes a destination
	movdqa	xmm10, xmm9	; h = g  Changed from xmm2 to xmm9
	pandn	xmm1, xmm9	; ~e & g Changed from xmm2 to xmm9

	movdqa	xmm9, xmm8	; f
	movdqa	xmm2, xmm8	; g = f	xmm9 became a destination but not until xmm2 was already used and replaced

	pand	xmm2, xmm0	; e & f
	pxor	xmm1, xmm2	; (e & f) ^ (~e & g)
	paddd	xmm6, xmm1	; Ch + h + w[i] + k[i]

	movdqa	xmm1, xmm0
	movdqa	xmm2, xmm0
	movdqa	xmm8, xmm0	; f = e Combining these three moves for processor hardware optimization
	psrld	xmm0, 6		; The xmm2 from xmm0 move used to be after this taking advantage of the r-rotate 6
	psrld	xmm2, 11	; Changed from 5 to 11 after shoving the movdqa commands together
	pslld	xmm1, 7
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 14
	psrld	xmm2, 14
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 5
	pxor	xmm0, xmm1	; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
	paddd	xmm6, xmm0	; xmm6 = t1
	paddd	xmm3, xmm6	; e = d+t1

	movdqa	xmm0, xmm3	; d
	movdqa	xmm1, xmm5	; =b
	movdqa	xmm2, xmm4	; c
	movdqa	xmm3, xmm2	; d = c
	pand	xmm2, xmm5	; b & c
	pand	xmm4, xmm7	; a & c
	pand	xmm1, xmm7	; a & b
	pxor	xmm1, xmm4
	pxor	xmm1, xmm2	; (a & c) ^ (a & d) ^ (c & d)
	paddd	xmm6, xmm1	; t1 + ((a & c) ^ (a & d) ^ (c & d))

	movdqa	xmm4, xmm5	; c = b
	movdqa	xmm5, xmm7	; b = a
	movdqa	xmm2, xmm7
	movdqa	xmm1, xmm7
	psrld	xmm7, 2
	pslld	xmm2, 10
	psrld	xmm1, 13
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 9
	psrld	xmm1, 9
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 11
	pxor	xmm7, xmm2
	paddd	xmm7, xmm6	; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));
%endmacro

%assign i 0
%rep    LAB_LOOP_UNROLL
        lab_loop_blk
%assign i i+1
%endrep

	cmp	rax, rcx
	jb	LAB_LOOP

; Finished the 64 rounds, calculate hash and save

	movntdqa	xmm1, [rdx]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm5, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm4, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm3, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm7, xmm1

	movntdqa	xmm1, [rdx+16]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm8, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm9, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm10, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm0, xmm1

	movdqa	[hash], xmm7
	movdqa	[hash+16], xmm5
	movdqa	[hash+32], xmm4
	movdqa	[hash+48], xmm3
	movdqa	[hash+64], xmm0
	movdqa	[hash+80], xmm8
	movdqa	[hash+96], xmm9
	movdqa	[hash+112], xmm10

LAB_RET:
	pop	rbx
	ret

SSE4 so far. I'm taking a break to watch anime. Cheesy

The changes take advantage of write combining hardware. If you have it great, if you don't won't notice much of a change. Probably won't notice much anyway since the basic code structure is the same. Eh, oh well.
Edit: Slight slow-down in the lab-loop. I'll copy-paste the old code back in to fix it later. O_O Bleach is on!

GenTarkin

legendary

Activity: 2450

Merit: 1002

any chance on an updated win32 build sometime? its still at v1.5.1....

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: plantucha on July 30, 2011, 07:36:32 PM

Quote from: d3m0n1q_733rz on July 30, 2011, 07:29:57 PM

Quote from: plantucha on July 30, 2011, 06:52:55 PM

Quote from: d3m0n1q_733rz on July 30, 2011, 05:56:51 PM

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm

Oops, sorry. The xmm version is what I meant. I keep thinking sse2 and sse4 for ease of my mind and maintaining difference of programming instructions. I'm looking for places to implement SSE3 instructions to run math calculations on dwords simultaneously, but I would have to restructure the entire program to take advantage of it and even then I'm not sure if it would work better or worse.

AMD phenom X6

sse2 17.4 MHash/s
fixed sse2 18.0 MHash/s
4way 20.4 MHash/s
sse4 illegal instruction

edit:
4way works in 1.4.down
in 1.5.up works too (same speed), but everything is rejected

I'll take a look at 4-way for you later tonight. The SSE4 illegal instruction would be the non-temporal move of the double-quad words from memory to the xmm registers via the movntdqa function not supported on AMD processors as of yet. I'm looking for a way to break down the data to double words so that I can take advantage of AMD's movntdd command to achieve the same thing, but it's more difficult than it sounds. I'm actually surprised that 4way is working faster for you than SSE2. Is this all after the threads normalized?

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: c_k on July 30, 2011, 08:44:28 PM

Great work ckolivas!

It looks like this will become the miner of choice with all the slick features you are adding.

Could you look at adding an option for monitoring the GPU temperature and backing off when it hits a maximum value and not resuming until it hits another minimum value?

If you included this you would be negating the need to ever use anything else imo

You know, technically, that feature should be maintained by the GPU itself. But I know that ufasoft has implemented it for some reason. It's more of a safeguard against failure of the hardware's throttle. Alternatively, you could try adjusting the fan speed of your card using free software so as to increase the fan speed at higher temps. Could help to not reach that temperature.

c_k

donator

Activity: 242

Merit: 100

Great work ckolivas!

It looks like this will become the miner of choice with all the slick features you are adding.

Could you look at adding an option for monitoring the GPU temperature and backing off when it hits a maximum value and not resuming until it hits another minimum value?

If you included this you would be negating the need to ever use anything else imo

plantucha

newbie

Activity: 56

Merit: 0

Quote from: d3m0n1q_733rz on July 30, 2011, 07:29:57 PM

Quote from: plantucha on July 30, 2011, 06:52:55 PM

Quote from: d3m0n1q_733rz on July 30, 2011, 05:56:51 PM

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm

Oops, sorry. The xmm version is what I meant. I keep thinking sse2 and sse4 for ease of my mind and maintaining difference of programming instructions. I'm looking for places to implement SSE3 instructions to run math calculations on dwords simultaneously, but I would have to restructure the entire program to take advantage of it and even then I'm not sure if it would work better or worse.

AMD phenom X6

sse2 17.4 MHash/s
fixed sse2 18.0 MHash/s
4way 20.4 MHash/s
sse4 illegal instruction

edit:
4way works in 1.4.down
in 1.5.up works too (same speed), but everything is rejected

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: plantucha on July 30, 2011, 06:52:55 PM

Quote from: d3m0n1q_733rz on July 30, 2011, 05:56:51 PM

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm

Oops, sorry. The xmm version is what I meant. I keep thinking sse2 and sse4 for ease of my mind and maintaining difference of programming instructions. I'm looking for places to implement SSE3 instructions to run math calculations on dwords simultaneously, but I would have to restructure the entire program to take advantage of it and even then I'm not sure if it would work better or worse.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: RudeDude on July 30, 2011, 06:53:09 PM

Quote from: d3m0n1q_733rz on July 30, 2011, 05:56:51 PM

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I pasted your ASM into sha256_xmm_amd64.asm and changed "movntdqa" to "movdqa" like you said for sse2. But I get a linker error.

Code:

...
cgminer-sha256_sse2_amd64.o: In function `scanhash_sse2_64':
sha256_sse2_amd64.c:(.text+0x4fb): undefined reference to `CalcSha256_x64'
sha256_sse2_amd64.c:(.text+0x50b): undefined reference to `CalcSha256_x64'
collect2: ld returned 1 exit status
...

I had to change "CalcSha256_x64_sse4" to "CalcSha256_x64" in two spots. Then the compile went just fine. I'm running now to see if it's any faster and if any work actually gets accepted bu t hopefully it's bug free.

btw, doesn't the assembler do basic inline math before assembling?

P.S. Hashrate looks really close to the same but I did get a work unit accepted just now.

EDIT: so the increase in speed, if any, is around 1% increase maybe slightly more. I only have two cores at 3.5 Mh/s each so it's hard to see the difference on the scale of Mhash/s.

Admittedly, there won't be much of a speed improvement just yet as I haven't really gone after the main loop. The vast majority of changes I've made only apply to just before the work is inserted into the loop. Also, leaving things to the assembler to do with the assumption that it will do it tends to leave room for problems to occur. Sometimes, a change that you think will take place doesn't and ends up adding to the CPU instructions to calculate. Often best to head them off before hand.

RudeDude

newbie

Activity: 11

Merit: 0

Quote from: d3m0n1q_733rz on July 30, 2011, 05:56:51 PM

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I pasted your ASM into sha256_xmm_amd64.asm and changed "movntdqa" to "movdqa" like you said for sse2. But I get a linker error.

Code:

...
cgminer-sha256_sse2_amd64.o: In function `scanhash_sse2_64':
sha256_sse2_amd64.c:(.text+0x4fb): undefined reference to `CalcSha256_x64'
sha256_sse2_amd64.c:(.text+0x50b): undefined reference to `CalcSha256_x64'
collect2: ld returned 1 exit status
...

I had to change "CalcSha256_x64_sse4" to "CalcSha256_x64" in two spots. Then the compile went just fine. I'm running now to see if it's any faster and if any work actually gets accepted bu t hopefully it's bug free.

btw, doesn't the assembler do basic inline math before assembling?

P.S. Hashrate looks really close to the same but I did get a work unit accepted just now.

EDIT: so the increase in speed, if any, is around 1% increase maybe slightly more. I only have two cores at 3.5 Mh/s each so it's hard to see the difference on the scale of Mhash/s.

plantucha

newbie

Activity: 56

Merit: 0

Quote from: d3m0n1q_733rz on July 30, 2011, 05:56:51 PM

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

I'll be attacking the LAB_LOOP next.

where is located sse2_amd64 file for AMD users?

in /x86_64
is only:

sha256_sse4_amd64.asm
sha256_xmm_amd64.asm

I don't see anywhere:
sha256_sse2_amd64.asm

xcooling

member

Activity: 145

Merit: 10

nice contribution..

still trying to get 64bit builds working on win7 64bit..

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Hey, I've been working on the hashing asm, as I said before, by removing redundancies of functions and register moves, using logic to modify source and destinations to take advantage of processor hardware optimizations and doing some of the easy math myself so the processor doesn't have to. Here's what I've done so far. It's not much, but it works. Don't go changing the github source just yet though. For now, copy-paste this to replace your existing sha256_sse4_amd64.asm file. For those of you without SSE4.1 (such as AMD users), copy paste this into you sse2_amd64 file instead and search-replace all uses of movntdqa with movdqa so the quick memory moves aren't used.

So here it is:

Code:

;; SHA-256 for X86-64 for Linux, based off of:

; (c) Ufasoft 2011 http://ufasoft.com mailto:[email protected]
; Version 2011
; This software is Public Domain

; Significant re-write/optimisation and reordering by,
; Neil Kettle 
; ~18% performance improvement

; SHA-256 CPU SSE cruncher for Bitcoin Miner

ALIGN 32
BITS 64

%define hash rdi
%define data rsi
%define init rdx

; 0 = (1024 - 256) (mod (LAB_CALC_UNROLL*LAB_CALC_PARA*16))
%define LAB_CALC_PARA	2
%define LAB_CALC_UNROLL	8

%define LAB_LOOP_UNROLL 8

extern g_4sha256_k

global CalcSha256_x64_sse4
;	CalcSha256	hash(rdi), data(rsi), init(rdx)
CalcSha256_x64_sse4:

	push	rbx

LAB_NEXT_NONCE:

	mov	rcx, 256					; 256 - rcx is # of SHA-2 rounds
;	mov	rax, 64					; 64 - rax is where we expand to

LAB_SHA:
	push	rcx
	lea	rcx, qword [data+(1024)]				; + 1024
	lea	r11, qword [data+(256)]				; + 256

LAB_CALC:
%macro	lab_calc_blk 1

	movntdqa	xmm0, [r11-(15-%1)*16]				; xmm0 = W[I-15]
	movntdqa	xmm4, [r11-(15-(%1+1))*16]			; xmm4 = W[I-15+1]
	movdqa	xmm2, xmm0					; xmm2 = W[I-15]	
	movdqa	xmm6, xmm4					; xmm6 = W[I-15+1]	

	psrld	xmm0, 3						; xmm0 = W[I-15] >> 3
	movdqa	xmm1, xmm0					; xmm1 = W[I-15] >> 3	
	pslld	xmm2, 14					; xmm2 = W[I-15] << 14			
	psrld	xmm4, 3						; xmm4 = W[I-15+1] >> 3
	movdqa	xmm5, xmm4					; xmm5 = W[I-15+1] >> 3
	psrld	xmm5, 4						; xmm5 = W[I-15+1] >> 7	
	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7)	
	pslld	xmm6, 14					; xmm6 = W[I-15+1] << 14
	psrld	xmm1, 4						; xmm1 = W[I-15] >> 7
	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14)
	psrld	xmm1, 11					; xmm1 = W[I-15] >> 18
	psrld	xmm5, 11					; xmm5 = W[I-15+1] >> 18
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14)
	pxor	xmm4, xmm5					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18)	
	pslld	xmm2, 11					; xmm2 = W[I-15] << 25
	pslld	xmm6, 11					; xmm6 = W[I-15+1] << 25
	pxor	xmm4, xmm6					; xmm4 = (W[I-15+1] >> 3) ^ (W[I-15+1] >> 7) ^ (W[I-15+1] << 14) ^ (W[I-15+1] >> 18) ^ (W[I-15+1] << 25)
	pxor	xmm0, xmm1					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18)
	pxor	xmm0, xmm2					; xmm0 = (W[I-15] >> 3) ^ (W[I-15] >> 7) ^ (W[I-15] << 14) ^ (W[I-15] >> 18) ^ (W[I-15] << 25)
	paddd	xmm0, [r11-(16-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16]
	paddd	xmm4, [r11-(16-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1]
	movntdqa	xmm3, [r11-(2-%1)*16]				; xmm3 = W[I-2]
	movntdqa	xmm7, [r11-(2-(%1+1))*16]			; xmm7 = W[I-2+1]

;;;;;;;;;;;;;;;;;;

	movdqa	xmm2, xmm3					; xmm2 = W[I-2]
	psrld	xmm3, 10					; xmm3 = W[I-2] >> 10
	movdqa	xmm1, xmm3					; xmm1 = W[I-2] >> 10
	movdqa	xmm6, xmm7					; xmm6 = W[I-2+1]
	psrld	xmm7, 10					; xmm7 = W[I-2+1] >> 10
	movdqa	xmm5, xmm7					; xmm5 = W[I-2+1] >> 10

	paddd	xmm0, [r11-(7-%1)*16]				; xmm0 = s0(W[I-15]) + W[I-16] + W[I-7]
	paddd	xmm4, [r11-(7-(%1+1))*16]			; xmm4 = s0(W[I-15+1]) + W[I-16+1] + W[I-7+1]
	
	pslld	xmm2, 13					; xmm2 = W[I-2] << 13
	pslld	xmm6, 13					; xmm6 = W[I-2+1] << 13
	psrld	xmm1, 7						; xmm1 = W[I-2] >> 17
	psrld	xmm5, 7						; xmm5 = W[I-2+1] >> 17



	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17)
	psrld	xmm1, 2						; xmm1 = W[I-2] >> 19
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13)
	pslld	xmm2, 2						; xmm2 = W[I-2] << 15
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17)
	psrld	xmm5, 2						; xmm5 = W[I-2+1] >> 19	
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13)
	pslld	xmm6, 2						; xmm6 = W[I-2+1] << 15



	pxor	xmm3, xmm1					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19)
	pxor	xmm3, xmm2					; xmm3 = (W[I-2] >> 10) ^ (W[I-2] >> 17) ^ (W[I-2] << 13) ^ (W[I-2] >> 19) ^ (W[I-2] << 15)
	paddd	xmm0, xmm3					; xmm0 = s0(W[I-15]) + W[I-16] + s1(W[I-2]) + W[I-7]
	pxor	xmm7, xmm5					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19)	
	pxor	xmm7, xmm6					; xmm7 = (W[I-2+1] >> 10) ^ (W[I-2+1] >> 17) ^ (W[I-2+1] << 13) ^ (W[I-2+1] >> 19) ^ (W[I-2+1] << 15)
	paddd	xmm4, xmm7					; xmm4 = s0(W[I-15+1]) + W[I-16+1] + s1(W[I-2+1]) + W[I-7+1]

	movdqa	[r11+(%1*16)], xmm0
	movdqa	[r11+((%1+1)*16)], xmm4
%endmacro

%assign i 0
%rep    LAB_CALC_UNROLL
        lab_calc_blk i
%assign i i+LAB_CALC_PARA
%endrep

	add	r11, LAB_CALC_UNROLL*LAB_CALC_PARA*16
	cmp	r11, rcx
	jb	LAB_CALC

	pop	rcx
	mov	rax, 0

; Load the init values of the message into the hash.

	movntdqa	xmm7, [init]
	movntdqa	xmm0, [init+16]
	pshufd	xmm5, xmm7, 0x55		; xmm5 == b
	pshufd	xmm8, xmm0, 0x55		; xmm8 == f
	pshufd	xmm4, xmm7, 0xAA		; xmm4 == c
	pshufd	xmm9, xmm0, 0xAA		; xmm9 == g
	pshufd	xmm3, xmm7, 0xFF		; xmm3 == d
	pshufd	xmm10, xmm0, 0xFF		; xmm10 == h
	pshufd	xmm7, xmm7, 0			; xmm7 == a
	pshufd	xmm0, xmm0, 0			; xmm0 == e
	
LAB_LOOP:

;; T t1 = h + (Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)) + ((e & f) ^ AndNot(e, g)) + Expand32(g_sha256_k[j]) + w[j]

%macro	lab_loop_blk 0
	movntdqa	xmm6, [data+rax*4]
	paddd	xmm6, g_4sha256_k[rax*4]
	add	rax, 4

	paddd	xmm6, xmm10	; +h

	movdqa	xmm1, xmm0
;	movdqa	xmm2, xmm9	; It's redundant unless xmm9 becomes a destination
	pandn	xmm1, xmm9	; ~e & g Changed from xmm2 to xmm9

	movdqa	xmm10, xmm9	; h = g  Changed from xmm2 to xmm9
	movdqa	xmm9, xmm8	; f
	movdqa	xmm2, xmm8	; g = f	xmm9 became a destination but not until xmm2 was already used and replaced

	pand	xmm2, xmm0	; e & f
	pxor	xmm1, xmm2	; (e & f) ^ (~e & g)
	paddd	xmm6, xmm1	; Ch + h + w[i] + k[i]

	movdqa	xmm8, xmm0	; f = e Combining these three moves for processor hardware optimization
	movdqa	xmm1, xmm0
	movdqa	xmm2, xmm0
	psrld	xmm0, 6		; The xmm2 from xmm0 move used to be after this taking advantage of the r-rotate 6
	pslld	xmm1, 7
	psrld	xmm2, 11	; Changed from 5 to 11 after shoving the movdqa commands together
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 14
	psrld	xmm2, 14
	pxor	xmm0, xmm1
	pxor	xmm0, xmm2
	pslld	xmm1, 5
	pxor	xmm0, xmm1	; Rotr32(e, 6) ^ Rotr32(e, 11) ^ Rotr32(e, 25)
	paddd	xmm6, xmm0	; xmm6 = t1

	movdqa	xmm0, xmm3	; d
	paddd	xmm0, xmm6	; e = d+t1

	movdqa	xmm1, xmm5	; =b
	movdqa	xmm3, xmm4	; d = c
	movdqa	xmm2, xmm4	; c
	pand	xmm2, xmm5	; b & c
	pand	xmm4, xmm7	; a & c
	pand	xmm1, xmm7	; a & b
	pxor	xmm1, xmm4
	movdqa	xmm4, xmm5	; c = b
	movdqa	xmm5, xmm7	; b = a
	pxor	xmm1, xmm2	; (a & c) ^ (a & d) ^ (c & d)
	paddd	xmm6, xmm1	; t1 + ((a & c) ^ (a & d) ^ (c & d))

	movdqa	xmm2, xmm7
	psrld	xmm7, 2
	movdqa	xmm1, xmm7
	pslld	xmm2, 10
	psrld	xmm1, 11
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 9
	psrld	xmm1, 9
	pxor	xmm7, xmm2
	pxor	xmm7, xmm1
	pslld	xmm2, 11
	pxor	xmm7, xmm2
	paddd	xmm7, xmm6	; a = t1 + (Rotr32(a, 2) ^ Rotr32(a, 13) ^ Rotr32(a, 22)) + ((a & c) ^ (a & d) ^ (c & d));
%endmacro

%assign i 0
%rep    LAB_LOOP_UNROLL
        lab_loop_blk
%assign i i+1
%endrep

	cmp	rax, rcx
	jb	LAB_LOOP

; Finished the 64 rounds, calculate hash and save

	movntdqa	xmm1, [rdx]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm5, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm4, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm3, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm7, xmm1

	movntdqa	xmm1, [rdx+16]
	pshufd	xmm2, xmm1, 0x55
	paddd	xmm8, xmm2
	pshufd	xmm6, xmm1, 0xAA
	paddd	xmm9, xmm6
	pshufd	xmm11, xmm1, 0xFF
	paddd	xmm10, xmm11
	pshufd	xmm1, xmm1, 0
	paddd	xmm0, xmm1

	movdqa	[hash], xmm7
	movdqa	[hash+16], xmm5
	movdqa	[hash+32], xmm4
	movdqa	[hash+48], xmm3
	movdqa	[hash+64], xmm0
	movdqa	[hash+80], xmm8
	movdqa	[hash+96], xmm9
	movdqa	[hash+112], xmm10

LAB_RET:
	pop	rbx
	ret

I'll be attacking the LAB_LOOP next.

PLaci1982

full member

Activity: 168

Merit: 100

Live long and prosper. \\//,

Quote from: miscreanity on July 30, 2011, 12:49:14 PM

Quote from: PLaci1982 on July 30, 2011, 10:30:37 AM

I did gave a try to atitweak, but it saw only 1 of 2 cards...
I don't want to play around with xorg.conf, and I don't have dummy plugs...

I have one another PC with a HD 2600 XT, and one with a HD 4550. The 1st ain't compatible with the SDK, the second ain't worth to mine with and that PC also does run 7/24...

Dummy plugs aren't necessary with Linux. If atitweak doesn't see both cards (assuming they both support CL and work with another OS), try aticonfig. Re-seat the cards, try swapping slots, install one card on the board at a time, etc... it could be a hardware or power problem. I haven't used Windows outside of virtual sessions for years, so I don't know if it's still more tolerant (read: lax) about hardware issues than Linux./quote]
The same setup works 100% with Win7 without any problem...

miscreanity

legendary

Activity: 1316

Merit: 1005

Quote from: -ck on July 30, 2011, 10:23:42 AM

Also be aware that it may silently pretend to set the values but not actually do so. Although my 6970s report ram speeds possible of 320-1450, if I set them to anything below 825, it actually just resets them to normal values.

I had the same issue with the 69xx series. There can't be >125 Mhz difference between core and memory clock speeds. It wasn't exactly elegant, but the solution that seems to work consistently can be found here:

http://forums.extremeoverclocking.com/showthread.php?t=355592

Basically, use a FreeDOS USB stick loaded with atiflash.exe to boot and extract the GPU BIOS. Then either use a Windows XP/Vista/7 system or a virtual session to run TechPowerUp's Radeon BIOS Editor. With it, you can set the memory clock speeds and save the updated BIOS. Reboot with the USB stick, flash the GPU BIOS and reboot a final time.

After that relatively painless process, I was able to use any method to underclock memory. It was well worth it, as my cards are running at the same Mh rate but ~7C cooler with memory underclocked and all other settings the same as before. I was even able to reliably raise my core speeds a bit past where they used to fail.

Quote from: PLaci1982 on July 30, 2011, 10:30:37 AM

I did gave a try to atitweak, but it saw only 1 of 2 cards...
I don't want to play around with xorg.conf, and I don't have dummy plugs...

I have one another PC with a HD 2600 XT, and one with a HD 4550. The 1st ain't compatible with the SDK, the second ain't worth to mine with and that PC also does run 7/24...

Dummy plugs aren't necessary with Linux. If atitweak doesn't see both cards (assuming they both support CL and work with another OS), try aticonfig. Re-seat the cards, try swapping slots, install one card on the board at a time, etc... it could be a hardware or power problem. I haven't used Windows outside of virtual sessions for years, so I don't know if it's still more tolerant (read: lax) about hardware issues than Linux.

GenTarkin

legendary

Activity: 2450

Merit: 1002

Hello, I have to say I love CGMINER it gets 4mh/s more on my 6950 than any config of GUIminer does. Im using windows.

I am trying to set it up on my other PC which has no opencl compatible device. GUIminer has no issue mining on the cpu but when start CGMINER it says no devices found and no matter what cpu flags I throw at it, it refuses to run...any ideas?
Says "Error getting devices IDs (num)"

Topic: OFFICIAL CGMINER mining software thread for linux/win/osx/mips/arm/r-pi 4.11.0 - page 817. (Read 5805740 times)