Pages:
Author

Topic: BTCMiner - Open Source Bitcoin Miner for ZTEX FPGA Boards, 215 MH/s on LX150 - page 30. (Read 161816 times)

rph
full member
Activity: 176
Merit: 100
Agreed.. ArtForz's worst-case routing delays and build times are certainly much better than mine.
I have hit a wall around 156MHz. I brute-force-scanned xst/map/par options with multiple PCs,
and none of them make anywhere near a 2-3X improvement. It's an RTL/design issue;
he apparently has some tricks that we haven't discovered.

I am going to experiment with area/placement constraints next.

-rph
donator
Activity: 367
Merit: 250
ZTEX FPGA Boards
ArtForz and rph, motivated by your work I started a new attempt to implement a 2 stages per round design. This time I took more care of the adders (not the overall utilization and speed as before). I got I routed, but not faster than 160 MHz. But there a still several things to try out ...

I also analyzed the map/par reports from ArtForz. The design seems to be much easier to route. It is mapped/routed at least 2-3 times faster than all other designs I have seen (even the simple 1 stage per round designs). Or is there an optimizer option or constraint I missed?

The reason of this is not the the arrangement of "W". If I omit it I can't see any improvement in routability.
sr. member
Activity: 406
Merit: 257
I'll give you a big fat hint: maybe having a nice and regular structure for the W updates isn't the best option...
rph
full member
Activity: 176
Merit: 100
Sharing is caring, so here's the business end of my VHDL. I'm planning to try
a few alternative options for the adders...

Code:
   one: if CYCLES = 1 generate
        t1     <= e1 + ch + i_t1;
        t2     <= e0 + maj;

        process(clk)
        begin
            if rising_edge(clk) then
                o_data(447 downto   0)   <= i_data(479 downto 32);
                o_data(479 downto 448)   <= s1 + i_data(287 downto 256) + i_data14;

                o_state( 31 downto   0)  <= t1 + t2;
                o_state( 63 downto  32)  <= i_state( 31 downto   0);
                o_state( 95 downto  64)  <= i_state( 63 downto  32);
                o_state(127 downto  96)  <= i_state( 95 downto  64);
                o_state(159 downto 128)  <= i_state(127 downto  96) + t1;
                o_state(191 downto 160)  <= i_state(159 downto 128);
                o_state(223 downto 192)  <= i_state(191 downto 160);


                o_t1     <= i_state(223 downto 192) + i_data(31 downto 0) + K_NEXT;
                o_data14 <= s0 + i_data(31 downto 0);
            end if;
        end process;
    end generate one;

    two: if CYCLES = 2 generate
        process(clk)
        begin
            if rising_edge(clk) then
                -- first cycle
                t1     <= e1 + ch + i_t1;
                t2     <= e0 + maj;

                data(447 downto   0)   <= i_data(479 downto 32);
                data(479 downto 448)   <= s1 + i_data(287 downto 256) + i_data14;

                state <= i_state;

                t1_p   <= i_state(223 downto 192) + i_data(31 downto 0) + K_NEXT;
                data14 <= s0 + i_data(31 downto 0);

                -- second cycle
                o_data <= data;

                o_state( 31 downto   0)  <= t1 + t2;
                o_state( 63 downto  32)  <= state( 31 downto   0);
                o_state( 95 downto  64)  <= state( 63 downto  32);
                o_state(127 downto  96)  <= state( 95 downto  64);
                o_state(159 downto 128)  <= state(127 downto  96) + t1;
                o_state(191 downto 160)  <= state(159 downto 128);
                o_state(223 downto 192)  <= state(191 downto 160);

                o_t1 <= t1_p;
                o_data14 <= data14;
            end if;
        end process;
    end generate two;

-rph
rph
full member
Activity: 176
Merit: 100
ArtForz, you'd have at least 3 friends for life if you posted the RTL.  Grin

With 2 clocks per stage, there are no >3 input adders, and xst seems to handle those
reasonably well. synth is fine; I'm currently battling the mapper, and its craptastic 4ns routes.
Either I have some long-distance routing requirement that ArtForz somehow eliminated,
or the tools are just being dumb and need some area constraint love.

-rph
member
Activity: 89
Merit: 10

This is a 5 input carry save adder I made in an attempt to fit two full chains in an LX150.
I don't know if this helps you guys out but I don't have time to test it myself atm  Embarrassed

download it here:  http://www.omegav.ntnu.no/~kamben/adder5x.vhd

or copy paste this:

-- This block uses 94 LUTs with only 29 Carry chain LUTs. (sliceM/L) (implemented purely combinatorial, no regs )
-- XST synth of 4 or 5 input adder uses 64 LUTs with 64 carry chain LUTs. (sliceM/L)   (implemented purely combinatorial, no regs )

LIBRARY IEEE;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned."+";

--Library UNISIM;
--use UNISIM.vcomponents.all;

ENTITY adder5x IS PORT (
 reset               : IN  std_logic;
 clk                 : IN  std_logic;

  ina                 : IN  std_logic_vector(31 downto 0);   
  inb                 : IN  std_logic_vector(31 downto 0);   
  inc                 : IN  std_logic_vector(31 downto 0);   
  ind                 : IN  std_logic_vector(31 downto 0);   
  ine                 : IN  std_logic_vector(31 downto 0);   
 
  qout                : OUT std_logic_vector(31 downto 0)); 
END  adder5x;


ARCHITECTURE rtl OF adder5x IS
 

SIGNAL  a:  std_logic_vector(31 downto 0);
SIGNAL  b:  std_logic_vector(31 downto 0);
SIGNAL  c:  std_logic_vector(31 downto 0);
SIGNAL  d:  std_logic_vector(31 downto 0); 
SIGNAL  e:  std_logic_vector(31 downto 0);   
SIGNAL  qr: std_logic_vector(31 downto 0);
--

SIGNAL SA,SAr     :std_logic_vector(31 downto 0);
SIGNAL SB,SBr     :std_logic_vector(31 downto 2);
SIGNAL S1,S2,S3   :std_logic_vector(31 downto 0);

--SIGNAL fasit : std_logic_vector(31 downto 0);

BEGIN
 
 
--  input_reg: PROCESS (reset, clk)
--BEGIN
--   IF (clk'event AND clk='1') THEN     
      a<=ina;
      b<=inb;                         
      c<=inc; 
      d<=ind; 
      e<=ine;   
--   END IF; 
--END PROCESS;   


--  pipe_reg: PROCESS (reset, clk)
--BEGIN
--   IF (clk'event AND clk='1') THEN          
      SAr<=SA;                           -- if your whole "chain" only has 1 pipeline register
      SBr<=SB;                           -- this might be a good place to put it
--   END IF; 
--END PROCESS;
   
   
--  output_reg: PROCESS (reset, clk)
--BEGIN 
--   IF (clk'event AND clk='1') THEN     
      qr<=SAr+(SBr & "00");             -- Regular carry chain adder for the last stage       
--   END IF; 
--END PROCESS; 


qout<=qr;


--fasit<=a+b+c+d+e;

------------
--calc


-- first LUT column of adder
-- 5 single bit inputs -> 3 bit sum output
LUT_stage1:FOR i IN 0 TO 31 GENERATE   
 
---------
S1(i)<=a(i) XOR b(i) XOR c(i) XOR d(i) XOR e(i);

-----
-- forced LUT alternative. slightly faster, uses more overall LUTs
-- could save 1 sliceM/L for every 2 adder blocks. Might make routing easier.

--LUT5_inst1a : LUT5
--generic map (
--INIT => x"96696996")
--port map (
--O =>  S1(i),
--I0 => a(i),
--I1 => b(i),
--I2 => c(i),
--I3 => d(i),
--I4 => e(i));   
-----
---------

LUT_inst1bc : LUT6_2
generic map (
INIT => x"E8808000177E7EE8")       
port map (
O6 =>  S3(i),
O5 =>  S2(i),
I0 => a(i),       
I1 => b(i),     
I2 => c(i),   
I3 => d(i),   
I4 => e(i),
I5 => '1');     

END GENERATE;


-- 2x3bit LUT sums -> 2+2bit output sum
-- max sum =  5+(2*5)=15, range 0-15 -> exact 4 bit
LUT_stage2A:FOR i IN 0 TO 15 GENERATE   

SA((i*2))<=S1((i*2));
SA((i*2)+1)<=S2((i*2)) XOR S1((i*2)+1); 

END GENERATE;   


--SB(0)<='0';
--SB(1)<='0'; 

LUT_stage2B:FOR i IN 0 TO 14 GENERATE   

LUT_inst2cd : LUT6_2
generic map (
INIT => x"0077640000641364")   
port map (
O6 =>  SB((i*2)+3),
O5 =>  SB((i*2)+2),
I0 => S2((i*2)),      -- B1
I1 => S3((i*2)),      -- C1
I2 => S1((i*2)+1),    -- A2
I3 => S2((i*2)+1),    -- B2
I4 => S3((i*2)+1),    -- C2
I5 => '1');   --   

END GENERATE;   


END rtl;


sr. member
Activity: 406
Merit: 257
Yep, and that's with 200ps clock jitter, your assumed 0-jitter clock would knock it down to 5.091ns cycle time => a bit over 196MHz Grin
On my real world rev1.1 boards with -2 speed grade, 1230-1250mV Vccint and 25°C ambient, this bitstream averages 193.9 MHz at very low error rate (0 errors over 2**35 hashes).
Pushing up error rate to 0.1% => 198.3MHz average.
rph
full member
Activity: 176
Merit: 100
Interesting. 5.172ns == 193MHz!

Thanks for the data.

-rph
sr. member
Activity: 406
Merit: 257
And here the exact same HDL/settings but using ISE 13.2 and tightening timing a bit:

Total REAL time to MAP completion:  30 mins 17 secs
Total CPU time to MAP completion:   28 mins 23 secs

Slice Logic Utilization:
  Number of Slice Registers:                92,968 out of 184,304   50%
    Number used as Flip Flops:              92,823
    Number used as Latches:                      0
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:              145
  Number of Slice LUTs:                     60,406 out of  92,152   65%
    Number used as logic:                   34,257 out of  92,152   37%
      Number using O6 output only:          21,087
      Number using O5 output only:             409
      Number using O5 and O6:               12,761
      Number used as ROM:                        0
    Number used as Memory:                   2,721 out of  21,680   12%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            0
      Number used as Shift Register:         2,721
        Number using O6 output only:           450
        Number using O5 output only:             0
        Number using O5 and O6:              2,271
    Number used exclusively as route-thrus: 23,428
      Number with same-slice register load: 23,414
      Number with same-slice carry load:        14
      Number with other load:                    0

Slice Logic Distribution:
  Number of occupied Slices:                15,446 out of  23,038   67%
  Number of LUT Flip Flop pairs used:       60,460
    Number with an unused Flip Flop:           868 out of  60,460    1%
    Number with an unused LUT:                  54 out of  60,460    1%
    Number of fully used LUT-FF pairs:      59,538 out of  60,460   98%
    Number of slice register sites lost
      to control set restrictions:               0 out of 184,304    0%

Total REAL time to Router completion: 25 mins 23 secs
Total CPU time to Router completion: 24 mins 26 secs


----------------------------------------------------------------------------------------------------------
  Constraint                                |    Check    | Worst Case |  Best Case | Timing |   Timing   
                                            |             |    Slack   | Achievable | Errors |    Score   
----------------------------------------------------------------------------------------------------------
  TS_coreclk = PERIOD TIMEGRP "tncoreclk" 1 | SETUP       |     0.233ns|     5.172ns|       0|           0
  85 MHz HIGH 50% INPUT_JITTER 0.2 ns       | HOLD        |     0.316ns|            |       0|           0
sr. member
Activity: 406
Merit: 257
As a little encouragement, here's a decent run for my old design (ISE synth+map+p&r, letting synth infer shift regs, no placement constraints, ...) for -3 speed grade

Device Utilization Summary:

Slice Logic Utilization:
  Number of Slice Registers:                92,964 out of 184,304   50%
    Number used as Flip Flops:              92,819
    Number used as Latches:                      0
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:              145
  Number of Slice LUTs:                     62,141 out of  92,152   67%
    Number used as logic:                   34,288 out of  92,152   37%
      Number using O6 output only:          21,087
      Number using O5 output only:             424
      Number using O5 and O6:               12,777
      Number used as ROM:                        0
    Number used as Memory:                   2,721 out of  21,680   12%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            0
      Number used as Shift Register:         2,721
        Number using O6 output only:           450
        Number using O5 output only:             0
        Number using O5 and O6:              2,271
    Number used exclusively as route-thrus: 25,132
      Number with same-slice register load: 25,117
      Number with same-slice carry load:        15
      Number with other load:                    0

Slice Logic Distribution:
  Number of occupied Slices:                16,519 out of  23,038   71%
  Number of LUT Flip Flop pairs used:       62,163
    Number with an unused Flip Flop:         2,573 out of  62,163    4%
    Number with an unused LUT:                  22 out of  62,163    1%
    Number of fully used LUT-FF pairs:      59,568 out of  62,163   95%
    Number of slice register sites lost
      to control set restrictions:               0 out of 184,304    0%

...
Total REAL time to PAR completion: 19 mins 45 secs
Total CPU time to PAR completion: 20 mins 34 secs


Timing:
 ================================================================================
 Timing constraint: TS_coreclk = PERIOD TIMEGRP "tncoreclk" 182 MHz HIGH 50% INPUT_JITTER 0.2 ns;
  3102658 paths analyzed, 386321 endpoints analyzed, 0 failing endpoints
  0 timing errors detected. (0 setup errors, 0 hold errors, 0 component switching limit errors)
  Minimum period is   5.262ns.
 --------------------------------------------------------------------------------
 
 Paths for end point XLXI_A/rb20/regt1_31 (SLICE_X104Y33.CIN), 252 paths
 --------------------------------------------------------------------------------
 Slack (setup path):     0.232ns (requirement - (data path - clock path skew + uncertainty))
   Source:               XLXI_A/rb19/outE_17 (FF)
   Destination:          XLXI_A/rb20/regt1_31 (FF)
   Requirement:          5.494ns
   Data Path Delay:      4.928ns (Levels of Logic = Cool
   Clock Path Skew:      -0.111ns (0.620 - 0.731)
   Source Clock:         coreclk rising at 0.000ns
   Destination Clock:    coreclk rising at 5.494ns
   Clock Uncertainty:    0.223ns
 
   Clock Uncertainty:          0.223ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
     Total System Jitter (TSJ):  0.070ns
     Total Input Jitter (TIJ):   0.200ns
     Discrete Jitter (DJ):       0.233ns
     Phase Error (PE):           0.000ns
 
   Maximum Data Path at Slow Process Corner: XLXI_A/rb19/outE_17 to XLXI_A/rb20/regt1_31
     Location             Delay type         Delay(ns)  Physical Resource
                                                        Logical Resource(s)
     -------------------------------------------------  -------------------
     SLICE_X126Y28.BQ     Tcko                  0.408   XLXI_A/rb19/outE<19>
                                                        XLXI_A/rb19/outE_17
     SLICE_X115Y27.A4     net (fanout=8)        1.658   XLXI_A/rb19/outE<17>
     SLICE_X115Y27.A      Tilo                  0.259   XLXI_A/rb20/s1<6>
                                                        XLXI_A/rb20/s1<6>1
     SLICE_X104Y27.CX     net (fanout=1)        1.798   XLXI_A/rb20/s1<6>
     SLICE_X104Y27.COUT   Tcxcy                 0.093   XLXI_A/rb20/regt1<7>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<7>
     SLICE_X104Y28.CIN    net (fanout=1)        0.003   XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<7>
     SLICE_X104Y28.COUT   Tbyp                  0.076   XLXI_A/rb20/regt1<11>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<11>
     SLICE_X104Y29.CIN    net (fanout=1)        0.003   XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<11>
     SLICE_X104Y29.COUT   Tbyp                  0.076   XLXI_A/rb20/regt1<15>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<15>
     SLICE_X104Y30.CIN    net (fanout=1)        0.003   XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<15>
     SLICE_X104Y30.COUT   Tbyp                  0.076   XLXI_A/rb20/regt1<19>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<19>
     SLICE_X104Y31.CIN    net (fanout=1)        0.003   XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<19>
     SLICE_X104Y31.COUT   Tbyp                  0.076   XLXI_A/rb20/regt1<23>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<23>
     SLICE_X104Y32.CIN    net (fanout=1)        0.003   XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<23>
     SLICE_X104Y32.COUT   Tbyp                  0.076   XLXI_A/rb20/regt1<27>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<27>
     SLICE_X104Y33.CIN    net (fanout=1)        0.003   XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_cy<27>
     SLICE_X104Y33.CLK    Tcinck                0.314   XLXI_A/rb20/regt1<31>
                                                        XLXI_A/rb20/Madd_s1[31]_ch[31]_add_18_OUT_xor<31>
                                                        XLXI_A/rb20/regt1_31
     -------------------------------------------------  ---------------------------
     Total                                      4.928ns (1.454ns logic, 3.474ns route)
                                                        (29.5% logic, 70.5% route)
rph
full member
Activity: 176
Merit: 100
Build summary as ztex requested. Hopefully substantiates that 150MH/s+
is possible in this device, non-OC'd, with some room to spare!  Grin

Slice Logic Utilization:
  Number of Slice Registers:                96,777 out of 184,304   52%
    Number used as Flip Flops:              96,777
    Number used as Latches:                      0
    Number used as Latch-thrus:                  0
    Number used as AND/OR logics:                0
  Number of Slice LUTs:                     58,692 out of  92,152   63%
    Number used as logic:                   39,716 out of  92,152   43%
      Number using O6 output only:          29,405
      Number using O5 output only:             424
      Number using O5 and O6:                9,887
      Number used as ROM:                        0
    Number used as Memory:                   3,056 out of  21,680   14%
      Number used as Dual Port RAM:              0
      Number used as Single Port RAM:            0
      Number used as Shift Register:         3,056
        Number using O6 output only:             0
        Number using O5 output only:             0
        Number using O5 and O6:              3,056
    Number used exclusively as route-thrus: 15,920
      Number with same-slice register load: 15,858
      Number with same-slice carry load:        62
      Number with other load:                    0
---
...
Phase 12  : 0 unrouted; (Setup:0, Hold:0, Component Switching Limit:0)     REAL time: 1 hrs 30 mins 48 secs
Total REAL time to Router completion: 1 hrs 30 mins 48 secs
Total CPU time to Router completion: 1 hrs 31 mins 27 secs
---
Critical path:
Slack:                  0.121ns (requirement - (data path - clock path skew + uncertainty))
  Source:               engine/sha2/pipe/stagegen[5].stageX/o_state_32 (FF)
  Destination:          engine/sha2/pipe/stagegen[6].stageX/t2_30 (FF)
  Requirement:          6.666ns
  Data Path Delay:      6.382ns (Levels of Logic = Cool
  Clock Path Skew:      -0.021ns (0.239 - 0.260)
  Source Clock:         clk rising at 0.000ns
  Destination Clock:    clk rising at 6.666ns
  Clock Uncertainty:    0.142ns

  Clock Uncertainty:          0.142ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.213ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: engine/sha2/pipe/stagegen[5].stageX/o_state_32 to engine/sha2/pipe/stagegen[6].stageX/t2_30
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    SLICE_X49Y175.AMUX   Tshcko                0.461   engine/sha2/pipe/stagegen[5].stageX/state<3>
                                                       engine/sha2/pipe/stagegen[5].stageX/o_state_32
    SLICE_X32Y164.A5     net (fanout=2)        4.629   engine/sha2/pipe/stagegen[5].stageX/o_state<32>
    SLICE_X32Y164.COUT   Topcya                0.395   engine/sha2/pipe/stagegen[6].stageX/t2<3>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_lut<0>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<3>
    SLICE_X32Y165.CIN    net (fanout=1)        0.003   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<3>
    SLICE_X32Y165.COUT   Tbyp                  0.076   engine/sha2/pipe/stagegen[6].stageX/t2<7>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<7>
    SLICE_X32Y166.CIN    net (fanout=1)        0.003   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<7>
    SLICE_X32Y166.COUT   Tbyp                  0.076   engine/sha2/pipe/stagegen[6].stageX/t2<11>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<11>
    SLICE_X32Y167.CIN    net (fanout=1)        0.003   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<11>
    SLICE_X32Y167.COUT   Tbyp                  0.076   engine/sha2/pipe/stagegen[6].stageX/t2<15>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<15>
    SLICE_X32Y168.CIN    net (fanout=1)        0.082   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<15>
    SLICE_X32Y168.COUT   Tbyp                  0.076   engine/sha2/pipe/stagegen[6].stageX/t2<19>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<19>
    SLICE_X32Y169.CIN    net (fanout=1)        0.003   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<19>
    SLICE_X32Y169.COUT   Tbyp                  0.076   engine/sha2/pipe/stagegen[6].stageX/t2<23>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<23>
    SLICE_X32Y170.CIN    net (fanout=1)        0.003   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<23>
    SLICE_X32Y170.COUT   Tbyp                  0.076   engine/sha2/pipe/stagegen[6].stageX/t2<27>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<27>
    SLICE_X32Y171.CIN    net (fanout=1)        0.003   engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_cy<27>
    SLICE_X32Y171.CLK    Tcinck                0.341   engine/sha2/pipe/stagegen[6].stageX/t2<31>
                                                       engine/sha2/pipe/stagegen[6].stageX/Madd_e0[31]_maj[31]_add_2_OUT_xor<31>
                                                       engine/sha2/pipe/stagegen[6].stageX/t2_30
    -------------------------------------------------  ---------------------------
    Total                                      6.382ns (1.653ns logic, 4.729ns route)
                                                       (25.9% logic, 74.1% route)
rph
full member
Activity: 176
Merit: 100
> ArtForz and rph, the theoretical speed limit according to xst for a 1 stage / sha256 round pipeline is 156.490 MHz (sounds familiar).
> This is defined by the levels of logic. In practice about 130 -135 MHz can be achieved.

Agreed. The xst Fmax is unrealistically high b/c it does not really consider routing delays.

I was using a two cycle per stage (3-input adder) design like ArtForz described, to reach 156MHz in -3. It took about 3 hours to map.
I had to tune the build options; with some settings map would just freeze for 1+ day during global placement. Very annoying.
Will post more once I build the HW (hopefully this weekend), and ensure the design works and produces valid hashes
and can actually be powered/cooled.

-rph
donator
Activity: 367
Merit: 250
ZTEX FPGA Boards
rph, please share your code with us or at least the report files (.syr, .map, .mrp, .par) so that I can believe your results.

ArtForz and rph, the theoretical speed limit according to xst for a 1 stage / sha256 round pipeline is 156.490 MHz (sounds familiar). This is defined by the levels of logic. In practice about 130 -135 MHz can be achieved.

The theoretical frequency limit of a 2 stage / sha256 pipeline reported by xst is up to 284.649 MHz, depending on the amount of additional registers. But as written before, I was not able to route such design. I even tried to use DSP slices which dropped the theoretical max. frequency to about 80MHz.
rph
full member
Activity: 176
Merit: 100
ArtForz, I've tried exactly that, and reached 156MHz in -3 (probably ~180MH/s OC'd).
I guess there is room to optimize a bit more.

The DSP48s are useful if you design around them. I've reached 500MHz+ with a rolled SHA256
core on V6 w/o much effort. The DSPs have dedicated routing and dedicated registers, so
there's less room for the SW tools to screw things up.

-rph
 
sr. member
Activity: 406
Merit: 257
While 2 pipelines could theoretically fit on a S6-LX150, routing is impossible (S6s have a lot less "long" routing resources than virtexes).
What *is* possible is one pipeline with 2 register stages per sha256 round, use SRL16s for W where possible, don't go overboard with em and don't let synthesis infer more shift regs.
And don't expect to get > 170MHz or so post-p&r on -2 without giving the placer some help.

*edit, SRL16, nor LSR16
*edit2: the xilinx docs aren't very clear on this, but a LUT6 in a SLICEM can be used as *2* equal-length SRL16s => a 32-bit wide shift reg up to 17 deep (slice FFs give the last stage) is only 4 SLICEMs.
donator
Activity: 367
Merit: 250
ZTEX FPGA Boards
Well, my theory is that two pipelines could (barely) fit into a 6s150-sized part, if it had more carry chains
plus DSP48E1s. With the SLICEX and only 180 DSP48A1s, there's just no way.
Did you ever try to validate your theory? Your theory does not consider some facts:
  • Only about 1/3 of the required sliced of the current design need to have a carry chain. As I wrote in my last post: There is not shortage in carry chains.
  • Its not sufficient just to place the LUT's, they also have to be connected in some way, i.e. they have to be routed. The routing resources are limiting factor.
  • DSP slices cannot be used for adder trees, are slower than LUT's and more difficult to route.

Quote
And.. I think DSP48, esp DSP48E1 in V6/A7/K7, can be very useful for SHA256.
I hope to validate that claim soon on real HW. Grin
I tried it on S6. The amount of required LUT's was reduced by less than 10% and AFAIR the clock was reduced to 70-80 MHz.
rph
full member
Activity: 176
Merit: 100
Well, my theory is that two pipelines could (barely) fit into a 6s150-sized part, if it had more carry chains
plus DSP48E1s. With the SLICEX and only 180 DSP48A1s, there's just no way.

RE Pricing: There will always be a premium for the Virtex. But much smaller than one might suspect
based on non-negotiated list prices on the web. V6 is a 40nm part; S6 is 45nm. And since V6/K7 doesn't
have the SLICEX limitations, and offers larger devices (meaning fewer boards to assemble)
I think it's worth it for a big operation.

And.. I think DSP48, esp DSP48E1 in V6/A7/K7, can be very useful for SHA256.
I hope to validate that claim soon on real HW. Grin

-rph
donator
Activity: 367
Merit: 250
ZTEX FPGA Boards
In high vol (>1k) the mid-range virtex6 parts become close to 6s150 in price per LUT.
High vol pricing is basically set by the die size.

I never asked my distributor for 1k prices of Spartan's or Virtex's, but that would surprise me due to several reasons:
  • Virtex FPGA are faster. Xilinx would not sell them for the same price/LUT
  • Virtex FPGA's contain a huge amount of large DSP slices which are mainly used for multiplication and are almost useless for bitcoin mining. These DSP slices are the backbone of Virtex FPGA's (used for HPC) and Xilinx would not give them away for free
  • Spartan FPGA's are more simple (SLICEX)

Quote
I would not use spartan6 in a large scale mining operation, as the SLICEX
makes about half the LUTs (thus, half the die) useless. Virtex6 or Kintex7, with a carry
chain in every LUT
There is no shortage in carry chains. About 60% of the used slices are SLICEX's, so I wouldn't agree that they cant be used Wink


rph
full member
Activity: 176
Merit: 100
In high vol (>1k) the mid-range virtex6 parts become close to 6s150 in price per LUT.
High vol pricing is basically set by the die size.

I would not use spartan6 in a large scale mining operation, as the SLICEX
makes about half the LUTs (thus, half the die) useless. Virtex6 or Kintex7, with a carry
chain in every LUT, should have higher MH/$ for the folks willing to invest real money
into FPGA mining. And then once you're in virtex, if your wallet allows, you can consider
Easypath..

-rph
member
Activity: 91
Merit: 10
Thanks for all the valuable info. You folks are awesome.
Pages:
Jump to: