Many thanks, that's very useful background! Unfortunately maths was never my strongest subject, but I'll take the time to understand it. I'm more the hack it together and see if it works type than the academic type.
I was rather hoping that the fpgaminer code would work "out of the box", but it seems things are never that simple.
I have made some progress though. I've been comparing the different versions and compiled the xilinx branch LX150_makomk_Test ... it needed a little bit of tweaking (GOLDEN_NONCE_OFFSET was out by one), but its working at LOOP_LOG2=3 and generating valid hashes
Its bumped up the throughput by 50%, so now I'm getting 7.6 MH/s at 80MHz and 14.9 MH/s at 120MHz (OOPS, belay that remark, its kicking out bad hash'es at 120MHz, not so good).
Multi-core sounds good, perhaps mixing the sizes (say a LOOP_LOG2=3 plus a LOOP_LOG2=4) to fill up the device, however I rather expect throughput to ultimately be thermally bound (the power dissipation will scale with MH/s rather than MHz, at least to a first degree). I plan to see what performance I can get at -20C (freezer temeratures), as this is far more practical with a 10Watt FPGA than a 200W GPU! It would be nice to dynamically set the clock speed too, so the devices can self-calibrate and ramp themselves up to a maximum clock speed. As I said in my earlier post, this is going to be fun. And if I can get the kit to pay for itself, then that's just a bonus
Again, many thanks, hope to stay in touch!
Maths is NOT my strong point either, but i can add up and multiply by 2 (right shift) 'and' 'xor'
e0
e1
ch
maj
sigma0
sigma1
Basically the speed 'weakness' in this algorithm is the long chain additions, the design can be broken down into TWO main sections.
The Expander & the Compressor, since an addition (x+y)+(p+z) is basically the same whichever way you do it.
you can calculate BOTH
(x+y)
(p+z)
At the SAME time, since neither independent result depends on the other.
consider:
w_out(511 downto 480) <= s1 + w_in(319 downto 288) + s0 + w_in(31 downto 0);
Whilst it executes within a "single clock cycle"
process(clk)
....
.....
The shear length of the additions DICTATES the number of logic levels and therefore the MINIMUM clock cycle length, due to the physical implementation of the routing.(you cannot go faster than a CLK cycle, all you can do is ensure your logic shortens it)
Also if you are going to stick shit into the freezer.
1. It ain't going to be a profitable way to mine at 7.6MH/s, since the cooling cost outweighs the bitcoin value
2. SEAL the device in a PLASTIC bag with some silica gel, because when you bring the stuff out of the freezer, moisture in the air is going to condense on the design and destroy it. (in a poly bag, it prevents condensation until the design reaches ambient , at which time it can be brought OUT of the bag. The silica gel acts as a buffer to ensure the bag is super low humidity)
3. Its NEVER going to pay for its-self at 7.6MH/S.