hrm.... yeah been doing more testing... and it seems liek I have high LUT usage because some of the "RAM" is being inferred as LUTs?
do you get any of these messages when you compile?
INFO:Xst:3218 - HDL ADVISOR - The RAM will be implemented on LUTs either because you have described an asynchronous read or because of currently unsupported block RAM features. If you have described an asynchronous read, making it synchronous would allow you to take advantage of available block RAM resources, for optimized device usage and improved timings. Please refer to your documentation for coding guidelines.
-----------------------------------------------------------------------
| ram_type | Distributed | |
-----------------------------------------------------------------------
| Port A |
| aspect ratio | 64-word x 32-bit | |
| weA | connected to signal | high |
| addrA | connected to signal | |
| diA | connected to signal | |
| doA | connected to signal | |
-----------------------------------------------------------------------
really odd.... it's not happening to all of the sha_transform modules though... it only seems to be one.... the 2nd one with the NUM_ROUNDS set to 61 it appears
also, I see things like this when it's synthesizing:
Found 6x6-bit multiplier for signal created at line 120.
Found 6x32-bit multiplier for signal created at line 127.
line 120 is:
assign K = Ks_mem[(NUM_ROUNDS/LOOP)*cnt+i];
line 127 is:
assign K_next = Ks_mem[(NUM_ROUNDS/LOOP)*((cnt+i) & (LOOP-1)) +i+1];
hrm... there has to be a better way to use those generate blocks to parse these values not as signals/wires to be used at runtime, but rather generate constant integers or look up tables/mux's to generate these...
edit: updateif I use this for the K and K_next assignment when LOOP == 1, I don't get the LUT messages anymore:
`ifdef USE_RAM_FOR_KS
if ( LOOP == 1) begin
assign K = Ks_mem[ i ];
assign K_next = Ks_mem[ i + 1 ];
end else begin
...
I think the problem is that K and K_next are not assigned in a clock state, thus they become asynchronous combinatorial logic - and XST can't map that to a ROM? Or maybe it's the addition of using a multiplier output as an address selector? Something in there XST wasn't liking for me.
also, it seems the 1st round synthesizes much differently?
for the first sha block I get this:
Summary:
inferred 10 Adder/Subtractor(s).
inferred 551 D-type flip-flop(s).
inferred 17 Multiplexer(s).
Unit synthesized.
for the 2nd block I get this:
Summary:
inferred 62 RAM(s).
inferred 2 Multiplier(s).
inferred 63 Adder/Subtractor(s).
inferred 295 D-type flip-flop(s).
inferred 17 Multiplexer(s).
Unit synthesized.
why are these so different!?
first off are they sharing the RAM for the K's ? It seems only the K's for the 2nd block are generated, but Xilinx might be optimizing across the hierarchy here. But what about the # of adders/subtractors!? only 10 in the first block? how can that be? or is it that it's shifting the position of the adders from the digester to the higher module?
I also see this:
Synthesizing Unit .
Related source file is "e:/bitcoin/lx150_makomk_test/hdl/sha256_transform.v".
LENGTH = 8
WARNING:Xst:3035 - Index value(s) does not match array range for signal , simulation mismatch.
which relates to the shift register code wi:
reg [31:0] m[0:(LENGTH-2)];
always @ (posedge clk)
begin
addr <= (addr + 1) % (LENGTH - 1);
now when I look at that, I'm not sure if that's correct, so lets say LENGTH = 8. The first line says create a 32-bit register array, with (8-2+1) elements, so 7 elements, but the addr modulous wraps around at 7 - e.g. once ( addr + 1 ) == 7, then addr becomes 0, not 7. So we are missing the last element of the shift register.
I think this is just an indexing problem - LENGTH = 8 means 8 elements in the shift register. so you want reg[32:0] m[0:7] or reg[32:0] m[0:(LENGTH-1)]. Then below on the addr assignment, you would want addr <= ( addr + 1 ) % ( LENGTH ). Because using a LENGTH of 8, xxx % 8 will always return a value inclusively between 0 and 7.
Not sure how this is even working with one of the shift registers effectively 1 element short....
edit: seems if I "fix" this, it breaks it heh..... I need to look into thisok another edit update, it seems this code is correct because you also have a 32-bit register r in there that's separate from the m storage register. And that also explains the different synthesis for this module. It's using a RAM, a 32-bit register r, 3-bit register addr, 9-bit adder for next address range, as opposed to just LENGTH*32 register/FF for the other types of shift registers... not sure which one is better here
on another note, I placed 2 cores ( 4 sha256 transforms ) into the design, it said I was using 140% LUTs, but it's still trying to route it right now? It's been running for over 12 hours though....