BTCMiner - Open Source Bitcoin Miner for ZTEX FPGA Boards, 215 MH/s on LX150 - page 30.

rph

full member

Activity: 176

Merit: 100

Agreed.. ArtForz's worst-case routing delays and build times are certainly much better than mine.
I have hit a wall around 156MHz. I brute-force-scanned xst/map/par options with multiple PCs,
and none of them make anywhere near a 2-3X improvement. It's an RTL/design issue;
he apparently has some tricks that we haven't discovered.

I am going to experiment with area/placement constraints next.

-rph

ztex

donator

Activity: 367

Merit: 250

ZTEX FPGA Boards

ArtForz and rph, motivated by your work I started a new attempt to implement a 2 stages per round design. This time I took more care of the adders (not the overall utilization and speed as before). I got I routed, but not faster than 160 MHz. But there a still several things to try out ...

I also analyzed the map/par reports from ArtForz. The design seems to be much easier to route. It is mapped/routed at least 2-3 times faster than all other designs I have seen (even the simple 1 stage per round designs). Or is there an optimizer option or constraint I missed?

The reason of this is not the the arrangement of "W". If I omit it I can't see any improvement in routability.

ArtForz

sr. member

Activity: 406

Merit: 257

I'll give you a big fat hint: maybe having a nice and regular structure for the W updates isn't the best option...

rph

full member

Activity: 176

Merit: 100

Sharing is caring, so here's the business end of my VHDL. I'm planning to try
a few alternative options for the adders...

Code:

one: if CYCLES = 1 generate
   t1 <= e1 + ch + i_t1;
   t2 <= e0 + maj;

   process(clk)
   begin
   if rising_edge(clk) then
   o_data(447 downto 0) <= i_data(479 downto 32);
   o_data(479 downto 448) <= s1 + i_data(287 downto 256) + i_data14;

   o_state( 31 downto 0) <= t1 + t2;
   o_state( 63 downto 32) <= i_state( 31 downto 0);
   o_state( 95 downto 64) <= i_state( 63 downto 32);
   o_state(127 downto 96) <= i_state( 95 downto 64);
   o_state(159 downto 128) <= i_state(127 downto 96) + t1;
   o_state(191 downto 160) <= i_state(159 downto 128);
   o_state(223 downto 192) <= i_state(191 downto 160);

   o_t1 <= i_state(223 downto 192) + i_data(31 downto 0) + K_NEXT;
   o_data14 <= s0 + i_data(31 downto 0);
   end if;
   end process;
   end generate one;

   two: if CYCLES = 2 generate
   process(clk)
   begin
   if rising_edge(clk) then
   -- first cycle
   t1 <= e1 + ch + i_t1;
   t2 <= e0 + maj;

   data(447 downto 0) <= i_data(479 downto 32);
   data(479 downto 448) <= s1 + i_data(287 downto 256) + i_data14;

   state <= i_state;

   t1_p <= i_state(223 downto 192) + i_data(31 downto 0) + K_NEXT;
   data14 <= s0 + i_data(31 downto 0);

   -- second cycle
   o_data <= data;

   o_state( 31 downto 0) <= t1 + t2;
   o_state( 63 downto 32) <= state( 31 downto 0);
   o_state( 95 downto 64) <= state( 63 downto 32);
   o_state(127 downto 96) <= state( 95 downto 64);
   o_state(159 downto 128) <= state(127 downto 96) + t1;
   o_state(191 downto 160) <= state(159 downto 128);
   o_state(223 downto 192) <= state(191 downto 160);

   o_t1 <= t1_p;
   o_data14 <= data14;
   end if;
   end process;
   end generate two;

-rph

rph

full member

Activity: 176

Merit: 100

ArtForz, you'd have at least 3 friends for life if you posted the RTL. Grin

With 2 clocks per stage, there are no >3 input adders, and xst seems to handle those
reasonably well. synth is fine; I'm currently battling the mapper, and its craptastic 4ns routes.
Either I have some long-distance routing requirement that ArtForz somehow eliminated,
or the tools are just being dumb and need some area constraint love.

-rph

pusle

member

Activity: 89

Merit: 10

This is a 5 input carry save adder I made in an attempt to fit two full chains in an LX150.
I don't know if this helps you guys out but I don't have time to test it myself atm Embarrassed

download it here: http://www.omegav.ntnu.no/~kamben/adder5x.vhd

or copy paste this:

-- This block uses 94 LUTs with only 29 Carry chain LUTs. (sliceM/L) (implemented purely combinatorial, no regs )
-- XST synth of 4 or 5 input adder uses 64 LUTs with 64 carry chain LUTs. (sliceM/L) (implemented purely combinatorial, no regs )

LIBRARY IEEE;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned."+";

--Library UNISIM;
--use UNISIM.vcomponents.all;

ENTITY adder5x IS PORT (
reset : IN std_logic;
clk : IN std_logic;

ina : IN std_logic_vector(31 downto 0);
inb : IN std_logic_vector(31 downto 0);
inc : IN std_logic_vector(31 downto 0);
ind : IN std_logic_vector(31 downto 0);
ine : IN std_logic_vector(31 downto 0);

qout : OUT std_logic_vector(31 downto 0));
END adder5x;

ARCHITECTURE rtl OF adder5x IS

SIGNAL a: std_logic_vector(31 downto 0);
SIGNAL b: std_logic_vector(31 downto 0);
SIGNAL c: std_logic_vector(31 downto 0);
SIGNAL d: std_logic_vector(31 downto 0);
SIGNAL e: std_logic_vector(31 downto 0);
SIGNAL qr: std_logic_vector(31 downto 0);
--

SIGNAL SA,SAr :std_logic_vector(31 downto 0);
SIGNAL SB,SBr :std_logic_vector(31 downto 2);
SIGNAL S1,S2,S3 :std_logic_vector(31 downto 0);

--SIGNAL fasit : std_logic_vector(31 downto 0);

BEGIN

-- input_reg: PROCESS (reset, clk)
--BEGIN
-- IF (clk'event AND clk='1') THEN
a<=ina;
b<=inb;
c<=inc;
d<=ind;
e<=ine;
-- END IF;
--END PROCESS;

-- pipe_reg: PROCESS (reset, clk)
--BEGIN
-- IF (clk'event AND clk='1') THEN
SAr<=SA; -- if your whole "chain" only has 1 pipeline register
SBr<=SB; -- this might be a good place to put it
-- END IF;
--END PROCESS;

-- output_reg: PROCESS (reset, clk)
--BEGIN
-- IF (clk'event AND clk='1') THEN
qr<=SAr+(SBr & "00"); -- Regular carry chain adder for the last stage
-- END IF;
--END PROCESS;

qout<=qr;

--fasit<=a+b+c+d+e;

------------
--calc

-- first LUT column of adder
-- 5 single bit inputs -> 3 bit sum output
LUT_stage1:FOR i IN 0 TO 31 GENERATE

---------
S1(i)<=a(i) XOR b(i) XOR c(i) XOR d(i) XOR e(i);

-----
-- forced LUT alternative. slightly faster, uses more overall LUTs
-- could save 1 sliceM/L for every 2 adder blocks. Might make routing easier.

--LUT5_inst1a : LUT5
--generic map (
--INIT => x"96696996")
--port map (
--O => S1(i),
--I0 => a(i),
--I1 => b(i),
--I2 => c(i),
--I3 => d(i),
--I4 => e(i));
-----
---------

LUT_inst1bc : LUT6_2
generic map (
INIT => x"E8808000177E7EE8")
port map (
O6 => S3(i),
O5 => S2(i),
I0 => a(i),
I1 => b(i),
I2 => c(i),
I3 => d(i),
I4 => e(i),
I5 => '1');

END GENERATE;

-- 2x3bit LUT sums -> 2+2bit output sum
-- max sum = 5+(2*5)=15, range 0-15 -> exact 4 bit
LUT_stage2A:FOR i IN 0 TO 15 GENERATE

SA((i*2))<=S1((i*2));
SA((i*2)+1)<=S2((i*2)) XOR S1((i*2)+1);

END GENERATE;

--SB(0)<='0';
--SB(1)<='0';

LUT_stage2B:FOR i IN 0 TO 14 GENERATE

LUT_inst2cd : LUT6_2
generic map (
INIT => x"0077640000641364")
port map (
O6 => SB((i*2)+3),
O5 => SB((i*2)+2),
I0 => S2((i*2)), -- B1
I1 => S3((i*2)), -- C1
I2 => S1((i*2)+1), -- A2
I3 => S2((i*2)+1), -- B2
I4 => S3((i*2)+1), -- C2
I5 => '1'); --

END GENERATE;

END rtl;

ArtForz

sr. member

Activity: 406

Merit: 257

Yep, and that's with 200ps clock jitter, your assumed 0-jitter clock would knock it down to 5.091ns cycle time => a bit over 196MHz Grin

On my real world rev1.1 boards with -2 speed grade, 1230-1250mV Vccint and 25°C ambient, this bitstream averages 193.9 MHz at very low error rate (0 errors over 2**35 hashes).
Pushing up error rate to 0.1% => 198.3MHz average.

rph

full member

Activity: 176

Merit: 100

Interesting. 5.172ns == 193MHz!

Thanks for the data.

-rph

ArtForz

sr. member

Activity: 406

Merit: 257

And here the exact same HDL/settings but using ISE 13.2 and tightening timing a bit:

Total REAL time to MAP completion: 30 mins 17 secs
Total CPU time to MAP completion: 28 mins 23 secs

Slice Logic Utilization:
Number of Slice Registers: 92,968 out of 184,304 50%
Number used as Flip Flops: 92,823
Number used as Latches: 0
Number used as Latch-thrus: 0
Number used as AND/OR logics: 145
Number of Slice LUTs: 60,406 out of 92,152 65%
Number used as logic: 34,257 out of 92,152 37%
Number using O6 output only: 21,087
Number using O5 output only: 409
Number using O5 and O6: 12,761
Number used as ROM: 0
Number used as Memory: 2,721 out of 21,680 12%
Number used as Dual Port RAM: 0
Number used as Single Port RAM: 0
Number used as Shift Register: 2,721
Number using O6 output only: 450
Number using O5 output only: 0
Number using O5 and O6: 2,271
Number used exclusively as route-thrus: 23,428
Number with same-slice register load: 23,414
Number with same-slice carry load: 14
Number with other load: 0

Slice Logic Distribution:
Number of occupied Slices: 15,446 out of 23,038 67%
Number of LUT Flip Flop pairs used: 60,460
Number with an unused Flip Flop: 868 out of 60,460 1%
Number with an unused LUT: 54 out of 60,460 1%
Number of fully used LUT-FF pairs: 59,538 out of 60,460 98%
Number of slice register sites lost
to control set restrictions: 0 out of 184,304 0%

Total REAL time to Router completion: 25 mins 23 secs
Total CPU time to Router completion: 24 mins 26 secs

----------------------------------------------------------------------------------------------------------
Constraint | Check | Worst Case | Best Case | Timing | Timing
| | Slack | Achievable | Errors | Score
----------------------------------------------------------------------------------------------------------
TS_coreclk = PERIOD TIMEGRP "tncoreclk" 1 | SETUP | 0.233ns| 5.172ns| 0| 0
85 MHz HIGH 50% INPUT_JITTER 0.2 ns | HOLD | 0.316ns| | 0| 0

ArtForz

sr. member

Activity: 406

Merit: 257

As a little encouragement, here's a decent run for my old design (ISE synth+map+p&r, letting synth infer shift regs, no placement constraints, ...) for -3 speed grade

Device Utilization Summary:

Slice Logic Utilization:
Number of Slice Registers: 92,964 out of 184,304 50%
Number used as Flip Flops: 92,819
Number used as Latches: 0
Number used as Latch-thrus: 0
Number used as AND/OR logics: 145
Number of Slice LUTs: 62,141 out of 92,152 67%
Number used as logic: 34,288 out of 92,152 37%
Number using O6 output only: 21,087
Number using O5 output only: 424
Number using O5 and O6: 12,777
Number used as ROM: 0
Number used as Memory: 2,721 out of 21,680 12%
Number used as Dual Port RAM: 0
Number used as Single Port RAM: 0
Number used as Shift Register: 2,721
Number using O6 output only: 450
Number using O5 output only: 0
Number using O5 and O6: 2,271
Number used exclusively as route-thrus: 25,132
Number with same-slice register load: 25,117
Number with same-slice carry load: 15
Number with other load: 0

Slice Logic Distribution:
Number of occupied Slices: 16,519 out of 23,038 71%
Number of LUT Flip Flop pairs used: 62,163
Number with an unused Flip Flop: 2,573 out of 62,163 4%
Number with an unused LUT: 22 out of 62,163 1%
Number of fully used LUT-FF pairs: 59,568 out of 62,163 95%
Number of slice register sites lost
to control set restrictions: 0 out of 184,304 0%

...
Total REAL time to PAR completion: 19 mins 45 secs
Total CPU time to PAR completion: 20 mins 34 secs

Timing:
================================================================================
Timing constraint: TS_coreclk = PERIOD TIMEGRP "tncoreclk" 182 MHz HIGH 50% INPUT_JITTER 0.2 ns;
3102658 paths analyzed, 386321 endpoints analyzed, 0 failing endpoints
0 timing errors detected. (0 setup errors, 0 hold errors, 0 component switching limit errors)
Minimum period is 5.262ns.
--------------------------------------------------------------------------------

Paths for end point XLXI_A/rb20/regt1_31 (SLICE_X104Y33.CIN), 252 paths
--------------------------------------------------------------------------------
Slack (setup path): 0.232ns (requirement - (data path - clock path skew + uncertainty))
Source: XLXI_A/rb19/outE_17 (FF)
Destination: XLXI_A/rb20/regt1_31 (FF)
Requirement: 5.494ns
Data Path Delay: 4.928ns (Levels of Logic = Cool

Clock Path Skew: -0.111ns (0.620 - 0.731)
Source Clock: coreclk rising at 0.000ns
Destination Clock: coreclk rising at 5.494ns
Clock Uncertainty: 0.223ns

Clock Uncertainty: 0.223ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.200ns
Discrete Jitter (DJ): 0.233ns
Phase Error (PE): 0.000ns

Maximum Data Path at Slow Process Corner: XLXI_A/rb19/outE_17 to XLXI_A/rb20/regt1_31
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
SLICE_X126Y28.BQ Tcko 0.408 XLXI_A/rb19/outE<19>
XLXI_A/rb19/outE_17
SLICE_X115Y27.A4 net (fanout=8) 1.658 XLXI_A/rb19/outE<17>
SLICE_X115Y27.A Tilo 0.259 XLXI_A/rb20/s1<6>
XLXI_A/rb20/s1<6>1
SLICE_X104Y27.CX net (fanout=1) 1.798 XLXI_A/rb20/s1<6>
SLICE_X104Y27.COUT Tcxcy 0.093 XLXI_A/rb20/regt1<7>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<7>
SLICE_X104Y28.CIN net (fanout=1) 0.003 XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<7>
SLICE_X104Y28.COUT Tbyp 0.076 XLXI_A/rb20/regt1<11>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<11>
SLICE_X104Y29.CIN net (fanout=1) 0.003 XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<11>
SLICE_X104Y29.COUT Tbyp 0.076 XLXI_A/rb20/regt1<15>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<15>
SLICE_X104Y30.CIN net (fanout=1) 0.003 XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<15>
SLICE_X104Y30.COUT Tbyp 0.076 XLXI_A/rb20/regt1<19>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<19>
SLICE_X104Y31.CIN net (fanout=1) 0.003 XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<19>
SLICE_X104Y31.COUT Tbyp 0.076 XLXI_A/rb20/regt1<23>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<23>
SLICE_X104Y32.CIN net (fanout=1) 0.003 XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<23>
SLICE_X104Y32.COUT Tbyp 0.076 XLXI_A/rb20/regt1<27>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<27>
SLICE_X104Y33.CIN net (fanout=1) 0.003 XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<27>
SLICE_X104Y33.CLK Tcinck 0.314 XLXI_A/rb20/regt1<31>
XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_xor<31>
XLXI_A/rb20/regt1_31
------------------------------------------------- ---------------------------
Total 4.928ns (1.454ns logic, 3.474ns route)
(29.5% logic, 70.5% route)

rph

full member

Activity: 176

Merit: 100

Build summary as ztex requested. Hopefully substantiates that 150MH/s+
is possible in this device, non-OC'd, with some room to spare! Grin

Slice Logic Utilization:
  Number of Slice Registers: 96,777 out of 184,304 52%
   Number used as Flip Flops: 96,777
   Number used as Latches: 0
   Number used as Latch-thrus: 0
   Number used as AND/OR logics: 0
  Number of Slice LUTs: 58,692 out of 92,152 63%
   Number used as logic: 39,716 out of 92,152 43%
   Number using O6 output only: 29,405
   Number using O5 output only: 424
   Number using O5 and O6: 9,887
   Number used as ROM: 0
   Number used as Memory: 3,056 out of 21,680 14%
   Number used as Dual Port RAM: 0
   Number used as Single Port RAM: 0
   Number used as Shift Register: 3,056
   Number using O6 output only: 0
   Number using O5 output only: 0
   Number using O5 and O6: 3,056
   Number used exclusively as route-thrus: 15,920
   Number with same-slice register load: 15,858
   Number with same-slice carry load: 62
   Number with other load: 0
---
...
Phase 12 : 0 unrouted; (Setup:0, Hold:0, Component Switching Limit:0) REAL time: 1 hrs 30 mins 48 secs
Total REAL time to Router completion: 1 hrs 30 mins 48 secs
Total CPU time to Router completion: 1 hrs 31 mins 27 secs
---
Critical path:
Slack: 0.121ns (requirement - (data path - clock path skew + uncertainty))
  Source: engine/sha2/pipe/stagegen[5].stageX/o_state_32 (FF)
  Destination: engine/sha2/pipe/stagegen[6].stageX/t2_30 (FF)
  Requirement: 6.666ns
  Data Path Delay: 6.382ns (Levels of Logic = Cool

  Clock Path Skew: -0.021ns (0.239 - 0.260)
  Source Clock: clk rising at 0.000ns
  Destination Clock: clk rising at 6.666ns
  Clock Uncertainty: 0.142ns

  Clock Uncertainty: 0.142ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
   Total System Jitter (TSJ): 0.070ns
   Total Input Jitter (TIJ): 0.000ns
   Discrete Jitter (DJ): 0.213ns
   Phase Error (PE): 0.000ns

  Maximum Data Path at Slow Process Corner: engine/sha2/pipe/stagegen[5].stageX/o_state_32 to engine/sha2/pipe/stagegen[6].stageX/t2_30
   Location Delay type Delay(ns) Physical Resource
   Logical Resource(s)
   ------------------------------------------------- -------------------
   SLICE_X49Y175.AMUX Tshcko 0.461 engine/sha2/pipe/stagegen[5].stageX/state<3>
   engine/sha2/pipe/stagegen[5].stageX/o_state_32
   SLICE_X32Y164.A5 net (fanout=2) 4.629 engine/sha2/pipe/stagegen[5].stageX/o_state<32>
   SLICE_X32Y164.COUT Topcya 0.395 engine/sha2/pipe/stagegen[6].stageX/t2<3>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_lut<0>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<3>
   SLICE_X32Y165.CIN net (fanout=1) 0.003 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<3>
   SLICE_X32Y165.COUT Tbyp 0.076 engine/sha2/pipe/stagegen[6].stageX/t2<7>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<7>
   SLICE_X32Y166.CIN net (fanout=1) 0.003 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<7>
   SLICE_X32Y166.COUT Tbyp 0.076 engine/sha2/pipe/stagegen[6].stageX/t2<11>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<11>
   SLICE_X32Y167.CIN net (fanout=1) 0.003 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<11>
   SLICE_X32Y167.COUT Tbyp 0.076 engine/sha2/pipe/stagegen[6].stageX/t2<15>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<15>
   SLICE_X32Y168.CIN net (fanout=1) 0.082 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<15>
   SLICE_X32Y168.COUT Tbyp 0.076 engine/sha2/pipe/stagegen[6].stageX/t2<19>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<19>
   SLICE_X32Y169.CIN net (fanout=1) 0.003 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<19>
   SLICE_X32Y169.COUT Tbyp 0.076 engine/sha2/pipe/stagegen[6].stageX/t2<23>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<23>
   SLICE_X32Y170.CIN net (fanout=1) 0.003 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<23>
   SLICE_X32Y170.COUT Tbyp 0.076 engine/sha2/pipe/stagegen[6].stageX/t2<27>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<27>
   SLICE_X32Y171.CIN net (fanout=1) 0.003 engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<27>
   SLICE_X32Y171.CLK Tcinck 0.341 engine/sha2/pipe/stagegen[6].stageX/t2<31>
   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_xor<31>
   engine/sha2/pipe/stagegen[6].stageX/t2_30
   ------------------------------------------------- ---------------------------
   Total 6.382ns (1.653ns logic, 4.729ns route)
   (25.9% logic, 74.1% route)

rph

full member

Activity: 176

Merit: 100

> ArtForz and rph, the theoretical speed limit according to xst for a 1 stage / sha256 round pipeline is 156.490 MHz (sounds familiar).
> This is defined by the levels of logic. In practice about 130 -135 MHz can be achieved.

Agreed. The xst Fmax is unrealistically high b/c it does not really consider routing delays.

I was using a two cycle per stage (3-input adder) design like ArtForz described, to reach 156MHz in -3. It took about 3 hours to map.
I had to tune the build options; with some settings map would just freeze for 1+ day during global placement. Very annoying.
Will post more once I build the HW (hopefully this weekend), and ensure the design works and produces valid hashes
and can actually be powered/cooled.

-rph

ztex

donator

Activity: 367

Merit: 250

ZTEX FPGA Boards

rph, please share your code with us or at least the report files (.syr, .map, .mrp, .par) so that I can believe your results.

ArtForz and rph, the theoretical speed limit according to xst for a 1 stage / sha256 round pipeline is 156.490 MHz (sounds familiar). This is defined by the levels of logic. In practice about 130 -135 MHz can be achieved.

The theoretical frequency limit of a 2 stage / sha256 pipeline reported by xst is up to 284.649 MHz, depending on the amount of additional registers. But as written before, I was not able to route such design. I even tried to use DSP slices which dropped the theoretical max. frequency to about 80MHz.

rph

full member

Activity: 176

Merit: 100

ArtForz, I've tried exactly that, and reached 156MHz in -3 (probably ~180MH/s OC'd).
I guess there is room to optimize a bit more.

The DSP48s are useful if you design around them. I've reached 500MHz+ with a rolled SHA256
core on V6 w/o much effort. The DSPs have dedicated routing and dedicated registers, so
there's less room for the SW tools to screw things up.

-rph

ArtForz

sr. member

Activity: 406

Merit: 257

While 2 pipelines could theoretically fit on a S6-LX150, routing is impossible (S6s have a lot less "long" routing resources than virtexes).
What *is* possible is one pipeline with 2 register stages per sha256 round, use SRL16s for W where possible, don't go overboard with em and don't let synthesis infer more shift regs.
And don't expect to get > 170MHz or so post-p&r on -2 without giving the placer some help.

*edit, SRL16, nor LSR16
*edit2: the xilinx docs aren't very clear on this, but a LUT6 in a SLICEM can be used as *2* equal-length SRL16s => a 32-bit wide shift reg up to 17 deep (slice FFs give the last stage) is only 4 SLICEMs.

ztex

donator

Activity: 367

Merit: 250

ZTEX FPGA Boards

Quote from: rph on September 08, 2011, 01:01:09 AM

Well, my theory is that two pipelines could (barely) fit into a 6s150-sized part, if it had more carry chains
plus DSP48E1s. With the SLICEX and only 180 DSP48A1s, there's just no way.

Did you ever try to validate your theory? Your theory does not consider some facts:

Only about 1/3 of the required sliced of the current design need to have a carry chain. As I wrote in my last post: There is not shortage in carry chains.
Its not sufficient just to place the LUT's, they also have to be connected in some way, i.e. they have to be routed. The routing resources are limiting factor.
DSP slices cannot be used for adder trees, are slower than LUT's and more difficult to route.

Quote

And.. I think DSP48, esp DSP48E1 in V6/A7/K7, can be very useful for SHA256.
I hope to validate that claim soon on real HW. Grin

I tried it on S6. The amount of required LUT's was reduced by less than 10% and AFAIR the clock was reduced to 70-80 MHz.

rph

full member

Activity: 176

Merit: 100

Well, my theory is that two pipelines could (barely) fit into a 6s150-sized part, if it had more carry chains
plus DSP48E1s. With the SLICEX and only 180 DSP48A1s, there's just no way.

RE Pricing: There will always be a premium for the Virtex. But much smaller than one might suspect
based on non-negotiated list prices on the web. V6 is a 40nm part; S6 is 45nm. And since V6/K7 doesn't
have the SLICEX limitations, and offers larger devices (meaning fewer boards to assemble)
I think it's worth it for a big operation.

And.. I think DSP48, esp DSP48E1 in V6/A7/K7, can be very useful for SHA256.
I hope to validate that claim soon on real HW. Grin

-rph

ztex

donator

Activity: 367

Merit: 250

ZTEX FPGA Boards

Quote from: rph on September 07, 2011, 01:01:40 AM

In high vol (>1k) the mid-range virtex6 parts become close to 6s150 in price per LUT.
High vol pricing is basically set by the die size.

I never asked my distributor for 1k prices of Spartan's or Virtex's, but that would surprise me due to several reasons:

Virtex FPGA are faster. Xilinx would not sell them for the same price/LUT
Virtex FPGA's contain a huge amount of large DSP slices which are mainly used for multiplication and are almost useless for bitcoin mining. These DSP slices are the backbone of Virtex FPGA's (used for HPC) and Xilinx would not give them away for free
Spartan FPGA's are more simple (SLICEX)

Quote

I would not use spartan6 in a large scale mining operation, as the SLICEX
makes about half the LUTs (thus, half the die) useless. Virtex6 or Kintex7, with a carry
chain in every LUT

There is no shortage in carry chains. About 60% of the used slices are SLICEX's, so I wouldn't agree that they cant be used Wink

rph

full member

Activity: 176

Merit: 100

In high vol (>1k) the mid-range virtex6 parts become close to 6s150 in price per LUT.
High vol pricing is basically set by the die size.

I would not use spartan6 in a large scale mining operation, as the SLICEX
makes about half the LUTs (thus, half the die) useless. Virtex6 or Kintex7, with a carry
chain in every LUT, should have higher MH/$ for the folks willing to invest real money
into FPGA mining. And then once you're in virtex, if your wallet allows, you can consider
Easypath..

-rph

Uhlbelk

member

Activity: 91

Merit: 10

Thanks for all the valuable info. You folks are awesome.

Topic: BTCMiner - Open Source Bitcoin Miner for ZTEX FPGA Boards, 215 MH/s on LX150 - page 30. (Read 161816 times)