[CLOSED] Bitmine CoinCraft A1 28nm chip distribution / DIY support - page 9.

zefir

donator

Activity: 919

Merit: 1000

Notification: A1 CoinCraft Desk cgminer driver now upstreamed

Since some of you asked for it: as of today, Con pulled the current A1 driver (as variant for CoinCraft Desk) into cgminer. I will follow soon with the variant for the Rig, and after we get the products into delivery state, I'll also work on a bfgminer integration.

goodney

member

Activity: 102

Merit: 10

Quote from: totalslacker on February 19, 2014, 08:30:32 PM

I haven't looked at power at the board level at all yet - hopefully that's the issue. I think this board is running a little low on the voltage front so hopefully I can clean things up with that.

totalslacker:

Have you been able to measure the current or power consumed by one chip at 800Mhz?

Thanks!

-a[g

Cheshyr

full member

Activity: 168

Merit: 100

Quote from: zefir on February 20, 2014, 06:52:22 AM

Quote from: silver71 on February 20, 2014, 06:10:08 AM

Why 16, when you already have ref.design for 8, and 8 is easier to cool...

Even when you think about immersion cooling, if enything goes wrong, your whole chain (16) goes offline...

Wouldn't it then be smarter to stick with 8 ?

16 was just a number I pulled from thin air

As for the technical challenge: with the experience collected so far, I'd say once you have an 8-chip chain working, it is not a huge step to move to 16 chips. At least from the A1 side, I don't understand the DCDC part of it to state it would be easy.

The immersion cooling idea follows DaT's approach here, where due to the high costs of the fluid it is essential to stuff as much hashing power into as little volume as possible. I think one could get a 4x4 A1 matrix onto a 10x10cm^2 PCB and stack them with 1cm distance. Resulting in a 6kW burner in a 1 liter cube.

That would be more of a fun than a serious project and I proposed this to be added as a challenge for Bitmine's planned design contest - which for obvious reasons was put at the back of the priority queue and might never leave the announcement phase Sad

Takers? I'd supply the chips and the fluid.

I'd be glad to poke at that after we finish these first two A1 designs. I've got a gallon of Novec 7000 in the corner, and was planning on playing with it using A1s anyway.

DCDC will be the real problem, and it really depends how fast you plan to clock the chips.

silver71

member

Activity: 101

Merit: 10

no avatar for now

Quote from: zefir on February 20, 2014, 06:52:22 AM

Quote from: silver71 on February 20, 2014, 06:10:08 AM

Why 16, when you already have ref.design for 8, and 8 is easier to cool...

Even when you think about immersion cooling, if enything goes wrong, your whole chain (16) goes offline...

Wouldn't it then be smarter to stick with 8 ?

16 was just a number I pulled from thin air

As for the technical challenge: with the experience collected so far, I'd say once you have an 8-chip chain working, it is not a huge step to move to 16 chips. At least from the A1 side, I don't understand the DCDC part of it to state it would be easy.

The immersion cooling idea follows DaT's approach here, where due to the high costs of the fluid it is essential to stuff as much hashing power into as little volume as possible. I think one could get a 4x4 A1 matrix onto a 10x10cm^2 PCB and stack them with 1cm distance. Resulting in a 6kW burner in a 1 liter cube.

That would be more of a fun than a serious project and I proposed this to be added as a challenge for Bitmine's planned design contest - which for obvious reasons was put at the back of the priority queue and might never leave the announcement phase Sad

Takers? I'd supply the chips and the fluid.

Let's first solve present rig delay (non-delivery), and go further with improved cooling ...

About the fluid...you meant oil or 3M-like liquid ?

It would be less expensive to just hook the rig to central heating (as a source), and use household radiators as coolers

zefir

donator

Activity: 919

Merit: 1000

Quote from: silver71 on February 20, 2014, 06:10:08 AM

Why 16, when you already have ref.design for 8, and 8 is easier to cool...

Even when you think about immersion cooling, if enything goes wrong, your whole chain (16) goes offline...

Wouldn't it then be smarter to stick with 8 ?

16 was just a number I pulled from thin air

As for the technical challenge: with the experience collected so far, I'd say once you have an 8-chip chain working, it is not a huge step to move to 16 chips. At least from the A1 side, I don't understand the DCDC part of it to state it would be easy.

The immersion cooling idea follows DaT's approach here, where due to the high costs of the fluid it is essential to stuff as much hashing power into as little volume as possible. I think one could get a 4x4 A1 matrix onto a 10x10cm^2 PCB and stack them with 1cm distance. Resulting in a 6kW burner in a 1 liter cube.

That would be more of a fun than a serious project and I proposed this to be added as a challenge for Bitmine's planned design contest - which for obvious reasons was put at the back of the priority queue and might never leave the announcement phase Sad

Takers? I'd supply the chips and the fluid.

silver71

member

Activity: 101

Merit: 10

no avatar for now

Quote from: zefir on February 20, 2014, 05:39:07 AM

Quote from: totalslacker on February 19, 2014, 08:30:32 PM

That looks to work very well. Unfortunately my hashing test is hitting some errors at anything much over 800Mhz.

I don't see chips dropping out or bad results being produced, just that nonces are missed. My code checks for all six nonces in sequence for each job issued (all four chips are run simultaneously and the job queue is kept as full as possible).

I don't think the nonce queue is overflowing as I'm issuing the read result command very frequently (and then checking chip status for finished jobs). I will get a bunch of no results and then a chip reports a nonce that has skipped a previous result.

I haven't looked at power at the board level at all yet - hopefully that's the issue. I think this board is running a little low on the voltage front so hopefully I can clean things up with that.

The good news is that I did get it to run at up to 1.25Ghz (40GH/s). About two-thirds of the nonces were dropped but it successfully ran the test (101 jobs) to completion.

My SPI frequency is a little low (think it's around 250Khz) but I wouldn't expect that to cause issues unless it were so low that the nonce queues couldn't be serviced frequently enough?

For 1.25GHz to work stable, you will need extreme cooling. Once the initial stress is over, I plan to develop a stackable 16-chip board for a submerged setup. A 10 PCB stack will fit into one liter - 160 chips ran at 40GHps, or 6.4TH in-a-box. But for heatsink cooling 1.25GHz is definitively a challenge.

As for the work / result pipelining: luckily, with input and output queues the job feeding and collecting of results is de-coupled, leaving you the freedom to find a good trade-off between their priorities. We dimensioned the output queue based on statistics collected from mining sessions which resulted in 99.9% of all jobs have 4 or less results.

If you run the A1 at nominal speed (800MHz), it will crunch the nonce range in ~160ms. That is, if every 160ms you collect the results and feed new jobs, you should get 99.9% of all potential results. To catch all results of a job with more than 4 winning nonces, you need to poll for results in-between, i.e. check every 80ms and you reduce loosing the 5th result significantly. Obviously, you can't cover all potential use cases and prevent loosing results: assume you have a job with 5 winning nonces that happen to be very close to each other - the chip will spit them out before you have time to collect. But the chances for that are negligible.

Bottom line: if you feed the chip at least every nonce period (e.g. 160ms at 800MHz) and check for jobs every half nonce period, you should be on the safe side.

As for SPI clock, I made some theoretical considerations here: https://bitcointalksearch.org/topic/m.4554508

Why 16, when you already have ref.design for 8, and 8 is easier to cool...

Even when you think about immersion cooling, if enything goes wrong, your whole chain (16) goes offline...

Wouldn't it then be smarter to stick with 8 ?

-ck

legendary

Activity: 4088

Merit: 1631

Ruu \o/

Code merged into master cgminer, thanks.

zefir

donator

Activity: 919

Merit: 1000

Quote from: totalslacker on February 19, 2014, 08:30:32 PM

That looks to work very well. Unfortunately my hashing test is hitting some errors at anything much over 800Mhz.

I don't see chips dropping out or bad results being produced, just that nonces are missed. My code checks for all six nonces in sequence for each job issued (all four chips are run simultaneously and the job queue is kept as full as possible).

I don't think the nonce queue is overflowing as I'm issuing the read result command very frequently (and then checking chip status for finished jobs). I will get a bunch of no results and then a chip reports a nonce that has skipped a previous result.

I haven't looked at power at the board level at all yet - hopefully that's the issue. I think this board is running a little low on the voltage front so hopefully I can clean things up with that.

The good news is that I did get it to run at up to 1.25Ghz (40GH/s). About two-thirds of the nonces were dropped but it successfully ran the test (101 jobs) to completion.

My SPI frequency is a little low (think it's around 250Khz) but I wouldn't expect that to cause issues unless it were so low that the nonce queues couldn't be serviced frequently enough?

For 1.25GHz to work stable, you will need extreme cooling. Once the initial stress is over, I plan to develop a stackable 16-chip board for a submerged setup. A 10 PCB stack will fit into one liter - 160 chips ran at 40GHps, or 6.4TH in-a-box. But for heatsink cooling 1.25GHz is definitively a challenge.

As for the work / result pipelining: luckily, with input and output queues the job feeding and collecting of results is de-coupled, leaving you the freedom to find a good trade-off between their priorities. We dimensioned the output queue based on statistics collected from mining sessions which resulted in 99.9% of all jobs have 4 or less results.

If you run the A1 at nominal speed (800MHz), it will crunch the nonce range in ~160ms. That is, if every 160ms you collect the results and feed new jobs, you should get 99.9% of all potential results. To catch all results of a job with more than 4 winning nonces, you need to poll for results in-between, i.e. check every 80ms and you reduce loosing the 5th result significantly. Obviously, you can't cover all potential use cases and prevent loosing results: assume you have a job with 5 winning nonces that happen to be very close to each other - the chip will spit them out before you have time to collect. But the chances for that are negligible.

Bottom line: if you feed the chip at least every nonce period (e.g. 160ms at 800MHz) and check for jobs every half nonce period, you should be on the safe side.

As for SPI clock, I made some theoretical considerations here: https://bitcointalksearch.org/topic/m.4554508

totalslacker

newbie

Activity: 26

Merit: 0

Quote from: zefir on February 19, 2014, 05:21:06 AM

Quote from: totalslacker on February 18, 2014, 09:15:16 PM

One other question here: are there known PLL settings for faster speeds?

I tried manually running through your set_pll_config code for this but the result didn't come out so well (I might very well have done this incorrectly). I wasn't quite sure what the extents of the various fields are as the data sheet looks to be different to your code (and to the various PLL settings posted).

Assuming you have a 12MHz ref clock, try this:

set pll_postdiv and pll_prediv to 2
set fbdiv to (target_sys_freq / 3), i.e. you can set your sys_clock in increments of 3MHz, with fbdiv being 9 bit you can go to 1.5+GHz

Code:

reg[0] = 0x84 | (fbdiv >> 8)
reg[1] = fbdiv & 0xff

Thanks zefir!

That looks to work very well. Unfortunately my hashing test is hitting some errors at anything much over 800Mhz.

I don't see chips dropping out or bad results being produced, just that nonces are missed. My code checks for all six nonces in sequence for each job issued (all four chips are run simultaneously and the job queue is kept as full as possible).

I don't think the nonce queue is overflowing as I'm issuing the read result command very frequently (and then checking chip status for finished jobs). I will get a bunch of no results and then a chip reports a nonce that has skipped a previous result.

I haven't looked at power at the board level at all yet - hopefully that's the issue. I think this board is running a little low on the voltage front so hopefully I can clean things up with that.

The good news is that I did get it to run at up to 1.25Ghz (40GH/s). About two-thirds of the nonces were dropped but it successfully ran the test (101 jobs) to completion.

My SPI frequency is a little low (think it's around 250Khz) but I wouldn't expect that to cause issues unless it were so low that the nonce queues couldn't be serviced frequently enough?

BuTaJIu4eK

member

Activity: 115

Merit: 10

Message to people who came from http://bitmine.ch/ !!! Undecided

https://bitcointalk.org/index.php?topic=291141.new#new
P.S. I'm sorry!

zulunation

sr. member

Activity: 335

Merit: 250

I have connected oscilloscope to the DO pin. On power up it shows 0.82V const. The Vddcore is 0.84V.
When cgminer is running it is still 0.82 const.

Does anyone saw such level?

silver71

member

Activity: 101

Merit: 10

no avatar for now

Quote from: Cheshyr on February 18, 2014, 11:06:08 PM

Or, you could use a level shifter.

Re: capacitance... Monte Carlo simulations, spice models, and rough prototypes don't take that long, and designing for worst case removes much of the complexity. High accuracy tolerances can be mitigated by buying a larger cap in many cases, or paralleling cheap large and cheap small caps.

The reference design is a foundation and guideline; I doubt it was meant to do all the engineering for you. Most of the tradeoffs are going to have to be engineering decisions by the person designing the board.

I will use the following analogy :

Here is some stuff Cointerra has as an issue with their SPI corruption, with their power supplies and with their board :

NOTE FROM Power-One POWER SUPPLY MANUAL :

Note: Care must be taken when using ceramic capacitors with a total capacitance of 1 µF to 50 µF on output V1, due to their
high quality factor the output ripple voltage may be increased in certain frequency ranges due to resonance effects.

However even though Cointerra uses 2x1100 and Bitmine uses 1x3000 PS, this excerpt note is from Power-One manual for 3 KW power supply.

The similar (but not the same case), probably could happen with Bitmine Coincraft power suppply...efects on SPI :

http://www.power-one.com/sites/power-one.com/files/documents/power/datasheet/bcd_00297_aa_pfe3000-12-069ra_datasheet.pdf

zefir

donator

Activity: 919

Merit: 1000

Quote from: zulunation on February 19, 2014, 03:34:52 AM

Quote from: BenTuras on February 19, 2014, 02:06:39 AM

Quote from: zefir on February 18, 2014, 06:34:36 PM

...
If you still don't get any feedback from the chip, please double check that the RSTn is kept at 18V for at least 1s before your first command to the chip.
...

I think you mean 1.8V !

Sure 1.8V.
But from the datasheet it states that signal is active low. So it must be kept for 1s 0V. Or do you mean that after it goes 0V for 1s it must be kept at 1.8V for another 1s before first command?

Reset is low active, you need to drive it low for at least 1s, then drive it to 1V8 and keep it stable there. Give the chip 1s before you send the first command, see the pseudo code snippet you referred to here: https://bitcointalksearch.org/topic/m.5205169

Quote from: totalslacker on February 18, 2014, 09:15:16 PM

One other question here: are there known PLL settings for faster speeds?

I tried manually running through your set_pll_config code for this but the result didn't come out so well (I might very well have done this incorrectly). I wasn't quite sure what the extents of the various fields are as the data sheet looks to be different to your code (and to the various PLL settings posted).

Assuming you have a 12MHz ref clock, try this:

set pll_postdiv and pll_prediv to 2
set fbdiv to (target_sys_freq / 3), i.e. you can set your sys_clock in increments of 3MHz, with fbdiv being 9 bit you can go to 1.5+GHz

Code:

reg[0] = 0x84 | (fbdiv >> 8)
reg[1] = fbdiv & 0xff

zulunation

sr. member

Activity: 335

Merit: 250

Quote from: BenTuras on February 19, 2014, 02:06:39 AM

Quote from: zefir on February 18, 2014, 06:34:36 PM

...
If you still don't get any feedback from the chip, please double check that the RSTn is kept at 18V for at least 1s before your first command to the chip.
...

I think you mean 1.8V !

Sure 1.8V.
But from the datasheet it states that signal is active low. So it must be kept for 1s 0V. Or do you mean that after it goes 0V for 1s it must be kept at 1.8V for another 1s before first command?

BenTuras

hero member

Activity: 826

Merit: 1001

Quote from: zefir on February 18, 2014, 06:34:36 PM

...
If you still don't get any feedback from the chip, please double check that the RSTn is kept at 18V for at least 1s before your first command to the chip.
...

I think you mean 1.8V !

justroll

newbie

Activity: 16

Merit: 0

tdk has a excellent reference doc concerning decoupling.

i think X7R 20% is a good standard for the things you need to do.

the dev /proto board is using a range of overscaled items. I hear that bitmine had some issues with their protoboard upfront regarding decoupling & thermal issues.

ref :

http://www.digikey.co.il/Web%20Export/Supplier%20Content/TDK_445/PDF/TDK_ESR_Controlled_MLCCs.pdf?redirected=1
http://www.youtube.com/watch?feature=player_embedded&v=Rrgdi843Dec
http://www.pcbcarolina.com/images/01_pcb_power_decoupling_myths_debunked.pdf
http://www.cdnusers.org/community/allegro/Resources/resources_pcbsi/si/tp_zhen_capacitors.pdf

Cheshyr

full member

Activity: 168

Merit: 100

Or, you could use a level shifter.

Re: capacitance... Monte Carlo simulations, spice models, and rough prototypes don't take that long, and designing for worst case removes much of the complexity. High accuracy tolerances can be mitigated by buying a larger cap in many cases, or paralleling cheap large and cheap small caps.

The reference design is a foundation and guideline; I doubt it was meant to do all the engineering for you. Most of the tradeoffs are going to have to be engineering decisions by the person designing the board.

silver71

member

Activity: 101

Merit: 10

no avatar for now

Quote from: zefir on January 10, 2014, 03:21:50 AM

Correctioin: Level Shifters mandatory

Quote from: zefir on January 07, 2014, 01:48:30 PM

Well, as pure SW guy I can provide only limited HW related feedback, so please double check.

Quote from: MrTeal on January 06, 2014, 08:18:26 PM

I have both a level shifter and the option to use an inline resistor to drop the 3.3V signal down to 1.8V on my test board similarly to how some Bitfury designs have implemented it. Have you investigated doing that, or just feeding 3.3V straight in?

I understood that the eval board used in China (the one you saw in the pictures) for testing has a level shifter for input and output signals, while Bitmine's boards use resistors to lower the input signals and a level shifter for the output signal (MISO) - seem to work both.

This seems to be true only for lower clock frequencies. As we approach the nominal clock range, the different delays within resistor network and integrated level shifter add up to a skew large enough to corrupt SPI communication. This is still being tested, but if you want to be on the safe side, use integrated level shifters for all signals to your uC (if its IO is not 1.8V).

Resitor tolerances for all chosen resistors for DIY 2xA1 PCB are 5% or worse. Such resistor tolarance makes driver software a difficult time to manage voltage levels on A1 since you don't know which one (voltage) will pop up and we talk second and third digit after decimal point of value.

Available resistor tolerances to choose from (parts), of course, much expensive but which might not trigger SPI corruption are :

resistor parts with available tolerances :

+/-0.01%
+/-0.02%
+/-0.05%
+/-0.1%
+/-0.2%
+/-0.25%
+/-0.5%
+/-1%
+/-2%
etc...

Any of them would work better then components currently chosen for DIY PCB.

Also capacitors chosen, are even worse, since they have 10% & 20% tollerance.
Why ?

Capacitors (e.g. ceramic) with available tolerances exist :
+/- 0.05 pF
+/- 0.075 pF
+/- 0.075 pF
+/- 0.2 pF
+/- 0.25 pF
+/- 0.5%
+/- 1%
+/- 2%
+/- 2.5%
etc...

Would it be a much lesser problem to overclock the board and chips if we would deal with high grade components (with less tolerance and proper higher power rating for overclocking) ? I think - yes, and until that happens, sqeezing any more # from A1 would be magic.

There is also Tempreature coefficient for every professional component.

To choose from you have following TCs:

A,B, BG, BP, BR, BX, C, C0G, NP0, C0H, C0J, C0K, CH, D, E, F, H3M, M3K, N1500, N2000, N2200, N2500, N2800, N4700, N750, P2G, P2H, P3K, P90, R, R2H, R3A, R3L, S2H, S3B, S3L, S3N, SL, SL/GP, T2H, U2J, U2M, X5E, X5F, X5P, X5R, X5S, X5U, X5V, X6S, X6T, X6R, X7R, X7S, X7T, X7U, X8G, X8L, X8R, X8Y, 5F, Y5P(B), Y5R, Y5S, Y5U (E), VF, Z4V, Z5P , Z5T, Z5U, Z5V, ZM

Can anyone address this ?

totalslacker

newbie

Activity: 26

Merit: 0

zefir,

One other question here: are there known PLL settings for faster speeds?

I ran a quick test (submitting 101 jobs to the four chips as fast as they could take them and then checking that I got all six of my nonces out of each one) and things looked good (no errors and came out right at 25GH/s).

I need to run this for longer of course to be sure, but I need to get the cooling going for that.

I'd also like to try pushing the chip clock rate for a short run, perhaps seeing if I can get to the 40GH/s turbo mode. I assume that's computed based on a 1.25GHz system clock so for my 12Mhz reference clock I would need an effective multiplier of 104.

I tried manually running through your set_pll_config code for this but the result didn't come out so well (I might very well have done this incorrectly). I wasn't quite sure what the extents of the various fields are as the data sheet looks to be different to your code (and to the various PLL settings posted).

Thanks for your help!

end18

newbie

Activity: 40

Merit: 0

Quote from: zefir on February 18, 2014, 06:34:36 PM

This. The sample chips need 850mV min, not sure if the chips from the production batch are better suitable for down-volting - but obviously everybody is looking to up-volt them anyway

Oh... I tried it to use 0v7 ~ 0v765 for safe testing....

I see that why my chips return very small level signal...

Thank you Zefir, I'll try it in 0v85.

Topic: [CLOSED] Bitmine CoinCraft A1 28nm chip distribution / DIY support - page 9. (Read 81340 times)