OK, so got boards back this week and initial checks all look good!
Have reliable SPI comms with all four chips and power looks to be nice and stable (for at least low frequency operation, need to add cooling before pushing them).
Zefir, your bring up instructions were very helpful - thank you! I was having trouble getting the chip to accept a job and then I finally realized I wasn't reading your instructions properly (had skipped a section). Fixed that and the chips are hashing
But, I'm not getting the same result as you test vector at
https://bitcointalksearch.org/topic/m.4454746The registers all look to be good. I can submit the first and second job. The appropriate job active bits are set and the correct job id is set. But, I only get one nonce: 47 b8 a6 62
I get this same single nonce on both jobs and the two chips I've tried. I've looked over the test vector and compared it to my code as well as capturing the SPI bus on a SPI analyzer and it all looks correct…
Might anyone have some suggestions on what I'm missing?
Thanks!
So I realized that the midstate and wdata might be in the improper byte order so I swapped them using the create_job function from zerfir's driver as a reference. No joy though as now I'm not getting any nonce solutions.
It was interesting though that the first chip initially was taking a very long time to run the two jobs - as in a good 30 seconds (at 90Mhz PLL). Chips 2 and 4 ran it in a few seconds as expected (chip 3 in my chain doesn't seem to be functional - SPI communications look to be good until I send it a job at which point it stops responding).
I powered the system down for a while then tried again, chip 1 was back to running at normal. So that's a tad concerning (the PLL had successfully locked in the first test where it was slow and the SPI clock was at the correct frequency for 90Mhz operation). Maybe some hash engines weren't functional (even though it reported all 32 as good)?
Any pointers on my incorrect nonce results would be greatly appreciated!
Thanks!
Hm, this to me looks like a power issue - at least the effects you observe are similar to what we had here until we got the DCDC stabilized.
First of all, try to not go below 200 MHz sysclock, since I am not sure if the PLL settings are correct for low values or how low it can really get. Operating the chip at 200MHz even without cooling is no problem.
Then ensure the supply voltage is above 0.85V and ripple is within valid tolerance. Same goes for reference 1V8, reset and SPI signals. If you got access to the register, I guess you did that correctly. You can try to stress-test the inter-chip SPI by continuously reading the register of each chip over a longer period to ensure there are no issues.
Usually, the serious troubles begin when you supply the chips with work and they start to hash. The power draw immediately spikes for order of magnitudes and if DCDC is not capable to handle that, the voltage ripple eventually will exceed the tolerance. The chip then usually resets itself, and with that you usually lose access to it, since the chip becomes unaware that it is part of a chain. To regain access to it, you need to HW reset the whole chain and re-enumerate the chips again.
A strong indication that the chip was reset after it started hashing is the inability to read out its register, i.e. you e.g. write 0x0a02 to get register of chip 2, and you read back 0x0a02 instead of 0x1a02, meaning there is no chip 2 in the chain any more.
We detected problems in the DCDC by scoping the levels long term and triggering for levels outside the tolerance range.
As for your other issue with the endianess of the job command: the provided driver uses 8bit transfers and the create_job() function prepares the data for byte-wise operation. If you are not using 16bit and did not modify the source code, please post a trace of the related SPI transfer and I will double check with my logs.