[CLOSED] Bitmine CoinCraft A1 28nm chip distribution / DIY support - page 4.

Dexter770221

legendary

Activity: 1029

Merit: 1000

I've had similiar problem. The problem was power supply. When I send write reg first and underclock chip to 200MHz then bist fix returned 31 cores, otherwise it was 1 (bist fix first then write reg).

zefir

donator

Activity: 919

Merit: 1000

Quote from: [gadget] on April 17, 2014, 02:23:25 AM

Quick question:

How many functional cores are you guys getting on your chips?
I've tested 4 tonight, and they all seem to only have 1 out of 32.
Are all sample chips like this? Or am I really, really unlucky?

The time to complete the test job is 5.2 sec @ 800MHz, which is consistent with
only 1 of 32 cores going. The chips aren't heating up much at all, either. Even
at 800MHz.

No, that is for sure not ok. As a rule of thumb, I would classify the chips as follows:

grade A: 31 - 32 working cores: run at full speed
grade B: 26 - 30 working cores: run at reduced speed (50-80% of nominal), bears low risk to disturb chain
grade C: 25 and less working cores: disable chip (or enable only after proper inspection), bears high risk to disturb chain

Alas, I have never seen a chip with only one working core so far. Are you sure your power supply is stable at ~850mV and you assert reset for ~1s before you issue the BIST command?

[gadget]

newbie

Activity: 30

Merit: 0

Quick question:

How many functional cores are you guys getting on your chips?
I've tested 4 tonight, and they all seem to only have 1 out of 32.
Are all sample chips like this? Or am I really, really unlucky?

The time to complete the test job is 5.2 sec @ 800MHz, which is consistent with
only 1 of 32 cores going. The chips aren't heating up much at all, either. Even
at 800MHz.

[gadget]

newbie

Activity: 30

Merit: 0

more progress:

fan header:
http://i.imgur.com/U4mDcso.jpg

fan header attached:
http://i.imgur.com/N47X0nd.jpg

heatsinked rig in operation:
http://i.imgur.com/JMRo5Kz.jpg

[gadget]

newbie

Activity: 30

Merit: 0

Here are some more pics from mazurov and [gadget]'s build.

A few boards we've put together:

http://i.imgur.com/r1pUq5G.jpg

View of the DCDC area:

http://i.imgur.com/23Aiuzp.jpg

View of the A1 area:

http://i.imgur.com/GkYiUPf.jpg

Heatsink (let's see how far it takes us):

http://i.imgur.com/x7427xS.jpg

The one tool we couldn't have done without was the microscope:

http://i.imgur.com/AtWUq26.jpg

And for those who read this far, here is a small treat - a corrected BOM. I can verify that these parts will get you to a working board (at least during bring-up

https://docs.google.com/spreadsheet/ccc?key=0AkO84VcUgOWgdFA5S0tGQTcxVVViX0I1VUlPaHhISEE&usp=sharing

mazurov

newbie

Activity: 2

Merit: 0

There is a TI TXB0106 -based level translator I made to talk to A1s. Requires VCC from the MCU board for high voltage side to function and translate correctly. Also provides 1.8V for A1s data interface. The LDO on the left is TI LP3871-1.8V. The circuit is trivial.

I need a second one and it takes too much time to build on a protoboard. I'm routing a PCB, will post when ready.

https://www.circuitsathome.com/wp/wp-content/uploads/2014/04/bc_levelshifter.jpg

mazurov

newbie

Activity: 2

Merit: 0

Quote from: mhmmd on March 25, 2014, 06:12:54 PM

Hello everybody

I've found a mess in the BOM of the two chips reference board!
So far the major issue is that there is a mixup of two versions: Ver. 1.0.a and Ver. 1.0.b.
The first evident difference is that the last release is missing C503 and C504, having therefore a different design.
In the repository there is a mix of files of the two versions and is impossible, at least for myself, to cross check diagram and BOM; I was doing such a check, finding some incongruences, when I discovered the problem.
If someone could help me to setup a 100% error free BOM, or at least provide me a diagram of the ver 1.0.b, I will appreciate very much and will be happy to send him/her a free PCB (I have 100 of them waiting to be populated).

Thank you

The only difference is R12. Take a look at my site, the notes should still be on the front page.

tindela1

newbie

Activity: 5

Merit: 0

I don't know if someone has already mentioned this, but we found out that engineering chips can be undervolted by first starting up at 0.85V, then when chips are hashing adjust output voltage. This would require that your DC/DC feedback is designed to allow dynamic voltage adjustment without making too much under-/overshoots.

- Noncetech

silver71

member

Activity: 101

Merit: 10

no avatar for now

Quote from: totalslacker on March 26, 2014, 10:18:10 AM

Quote from: zefir on March 26, 2014, 03:25:55 AM

Quote

But when we tried to push it past 1050MHz clock (to all the way to 1200MHz) it seems that cgminer is showing us wrong results. Cgminer showed a bit smaller hashing speed than expected (Sys_clk * 32), but it kept on going all the way to 38GH/s per chip. HW errors were very small, smaller than 32GH/s settings. Did not have any rejections or stales.

Hello,

a diverging hashrate at pool and cgminer simply means you are losing shares through HW errors.

What you need to consider is:

a) a detected HW error also implies that there were errors on true results; the related probability needs to be derived correctly, but I would assume that when you have a HW error rate of 5% it also means you are missing 5% of real results

I have noticed the same thing when pushing the hardware beyond 25GH/s. In my case I'm looping the test vector zefir had posted a while back. Since this has known nonces I can verify that the hardware is returning the correct nonce sequence. Irrespective of errors the time taken for each chip to finish a job always seems to correlate very closely to the configured hash rate.

I notice that the hardware tends to drop nonces before it starts to produce bad ones. As I push the chip harder and harder the "good" nonce rate drops to zero and bad nonces become frequent.

However, this ultimately is all a symptom of too low core voltage. 35GH/s gets pretty stable at around 1.050V. I had previously thought it stable at 0.975V but longer tests started producing more errors…

I've modified my supply to get higher output voltages but haven't gotten back to testing it yet.

You do need aggressive cooling at these voltages so be careful! You can get away with short runs with minimal cooling but be careful. Even sitting idle at these voltages it's easy to generate enough heat to pop a chip (as I learned the other day when my code crashed int he debugger and I got distracted trying to figure out a bug I had been seeing from time to time).

Would isolation of SPI cables uC<>blades help ? Shielding them like S/FTP LAN ?

totalslacker

newbie

Activity: 26

Merit: 0

Quote from: zefir on March 26, 2014, 03:25:55 AM

Quote

But when we tried to push it past 1050MHz clock (to all the way to 1200MHz) it seems that cgminer is showing us wrong results. Cgminer showed a bit smaller hashing speed than expected (Sys_clk * 32), but it kept on going all the way to 38GH/s per chip. HW errors were very small, smaller than 32GH/s settings. Did not have any rejections or stales.

Hello,

a diverging hashrate at pool and cgminer simply means you are losing shares through HW errors.

What you need to consider is:

a) a detected HW error also implies that there were errors on true results; the related probability needs to be derived correctly, but I would assume that when you have a HW error rate of 5% it also means you are missing 5% of real results

I have noticed the same thing when pushing the hardware beyond 25GH/s. In my case I'm looping the test vector zefir had posted a while back. Since this has known nonces I can verify that the hardware is returning the correct nonce sequence. Irrespective of errors the time taken for each chip to finish a job always seems to correlate very closely to the configured hash rate.

I notice that the hardware tends to drop nonces before it starts to produce bad ones. As I push the chip harder and harder the "good" nonce rate drops to zero and bad nonces become frequent.

However, this ultimately is all a symptom of too low core voltage. 35GH/s gets pretty stable at around 1.050V. I had previously thought it stable at 0.975V but longer tests started producing more errors…

I've modified my supply to get higher output voltages but haven't gotten back to testing it yet.

You do need aggressive cooling at these voltages so be careful! You can get away with short runs with minimal cooling but be careful. Even sitting idle at these voltages it's easy to generate enough heat to pop a chip (as I learned the other day when my code crashed int he debugger and I got distracted trying to figure out a bug I had been seeing from time to time).

zefir

donator

Activity: 919

Merit: 1000

Info: Clarification on HW errors / hashrate

I got the below SW support request via PM which I think is relevant for other DIY projects and therefore want to respond here publicly.

Quote

Hello Zefir,

The chips are working nicely!

But when we tried to push it past 1050MHz clock (to all the way to 1200MHz) it seems that cgminer is showing us wrong results. Cgminer showed a bit smaller hashing speed than expected (Sys_clk * 32), but it kept on going all the way to 38GH/s per chip. HW errors were very small, smaller than 32GH/s settings. Did not have any rejections or stales.

I checked also the PLL setting trace and it corresponded datasheet (fbdiv = 71-78, pre and postdivs were at 1, our ref clk is 16MHz).

Then we examined pool's results, it was showing rather 250GHs - 330GH/s. Then we switched back to slower setting, pools were showing immediately higher hashing speeds.

Could give us some advice on this, or point out where could be the possible reason. (We are using the latest cgminer bitmine-A1-driver fork).

Thank you in advance!

Hello,

a diverging hashrate at pool and cgminer simply means you are losing shares through HW errors.

What you need to consider is:

a) a detected HW error also implies that there were errors on true results; the related probability needs to be derived correctly, but I would assume that when you have a HW error rate of 5% it also means you are missing 5% of real results

b) the A1 uses real target, that is, if your pool sends you diff256 work, A1 filters any result witch lower difficulty. In that case, generating a HW error is (at least, needs correct mathematical analysis) 256 times less probable (since you need to generate a wrong diff256 share) therefore you won't see many HW errors with increasing difficulty. Equivalently, because of a) HW errors will cause loss of wrongly calculated real shares.

The current cgminer driver for the A1 is meant for a field deployment where optimal hashrate was measured before and PLL is not tuned by users. If you need to have some meaningful feedback on HW errors to tune your system, you can achieve this easily by letting the A1 report Diff1 shares. For that, you basically need to prevent setting the real target for the jobs with this patch:

Code:

diff --git a/driver-SPI-bitmine-A1.c b/driver-SPI-bitmine-A1.c
index 81df48d..0104c34 100644
--- a/driver-SPI-bitmine-A1.c
+++ b/driver-SPI-bitmine-A1.c
@@ -652,7 +652,6 @@ static uint8_t *create_job(uint8_t chip_id, uint8_t job_id, struct work *work)
        p1[0] = bswap_32(p2[0]);
        p1[1] = bswap_32(p2[1]);
        p1[2] = bswap_32(p2[2]);
-       p1[4] = get_diff(work->sdiff);
        return job;
 }

Good Luck

tindela1

newbie

Activity: 5

Merit: 0

Quote from: goodney on March 25, 2014, 07:07:07 PM

Quote from: tindela1 on March 23, 2014, 06:06:13 PM

Just ran for short period of time. And here is snapshot from our build:

http://i.imgur.com/uDNUiKz.jpg

- Noncetech

Noncetech: how many A1's are under that heatsink/fan combo? Is the heatsink we see cooling the chips or the board? Do you have cooling on both sides? And finally, do you have current draw numbers?

Looks great though!

-a[g

The heatsink+fan combo seen on the picture is about 0.22W/Cdeg heatsink fan combo, which is cooling the topside of A1 chips. We have another bigger 0.16W/Cdeg heatsink+fan cooling the board. We measured some temperatures today at 800MHz and it was quite stable at 65-67Cdeg and for another board 55-58Cdeg. Our FETs and inductors stablized around 60Cdeg.

We had two board stack-up configuration. I guess the another board is not getting fresh air enough. We may need to adjust the upper heatsink alignment, so that its pushing the air out properly.

We have 8 chips under those heatsinks.

EDIT: We don't have proper equipment for accurate on-board current measurements at this moment, thus unable to measure our buck's efficiency at different loads. But we had a wattage meter. I don't have numbers now for 25GH/s setting, but we were getting around 490-510W in total at ~490GH/s (16 chips configuration). Raspberry and ATX PSU were drawing 25W, so it needs to be reduced from total amount in order to get board specific. All the chips did overclock quite nicely. Still need to determine the optimum spot.

http://i.imgur.com/pUAjFZ8.jpg

- Noncetech

goodney

member

Activity: 102

Merit: 10

Quote from: tindela1 on March 23, 2014, 06:06:13 PM

Just ran for short period of time. And here is snapshot from our build:

- Noncetech

Noncetech: how many A1's are under that heatsink/fan combo? Is the heatsink we see cooling the chips or the board? Do you have cooling on both sides? And finally, do you have current draw numbers?

Looks great though!

-a[g

mhmmd

newbie

Activity: 5

Merit: 0

Hello everybody

I've found a mess in the BOM of the two chips reference board!
So far the major issue is that there is a mixup of two versions: Ver. 1.0.a and Ver. 1.0.b.
The first evident difference is that the last release is missing C503 and C504, having therefore a different design.
In the repository there is a mix of files of the two versions and is impossible, at least for myself, to cross check diagram and BOM; I was doing such a check, finding some incongruences, when I discovered the problem.
If someone could help me to setup a 100% error free BOM, or at least provide me a diagram of the ver 1.0.b, I will appreciate very much and will be happy to send him/her a free PCB (I have 100 of them waiting to be populated).

Thank you

tindela1

newbie

Activity: 5

Merit: 0

Here is our initial testing results (2x4 IC configuration @ 0.85V):

http://i.imgur.com/HGszbm0.jpg

Just ran for short period of time. And here is snapshot from our build:

http://i.imgur.com/uDNUiKz.jpg

- Noncetech

Bicknellski

hero member

Activity: 924

Merit: 1000

The WPC EE says:

Quote

I am blinking the lights!

So:

1. 3.3V regulator working
2. JTAG interface to UC3C working
3. Processor is initializing
4. ASF now working (took moving to Studio 6.2, since all UC3C was broken in 6.1)
5. 12MHz oscillator is working
6. Can do one LED - all 3 colors are working

Yay! (Finally!)

silver71

member

Activity: 101

Merit: 10

no avatar for now

Quote from: totalslacker on March 19, 2014, 04:18:45 PM

I added code to trim my supply and the good news is that it looks like the board is stable at 35GH/s at 0.975V. To get any faster than that I need to get over the max 1.050V my supply is able to put out (to do that I need to disassemble the cooling and change a sense resistor - not terrible but it is a hassle).

A single chip (the board has four) almost works at 1.050V/40GH/s. If I run all four then the supply under load drops to about 1.030V which doesn't work too well.

Need to validate this on more than one board of course

It's useless to operate any PS on border limit power, since power flicker will reset your boards...

Use 2 or more...we have such solution if you have no clue...let me know.

Bicknellski

hero member

Activity: 924

Merit: 1000

Interesting.

Yes the 2 phase cooling like they have in the Asicminer mining center might be a great option for these as you could dump the heat sinks and fans potentially.

https://bitcointalksearch.org/topic/visit-of-asicminers-immersion-cooling-mining-facility-346134

http://www.enterprisetech.com/2013/11/24/3m-allied-control-cool-clusters-novec-bubble-bath/

http://www.allied-control.com/
http://www.clusteredsystems.com/

totalslacker

newbie

Activity: 26

Merit: 0

I failed to measure current at 40GH - I can check that next time I try it.

35GH/s was [email protected] = 38W.

I did try a longer run under bfgminer and was seeing some hardware errors (about 0.5%). Not sure why my test wasn't catching them - I was just running zefir's test vector over and over (and validating that the correct nonces were returned). Guess I need more test vectors

The board did start getting pretty hot. I have a water block attached to the bottom of the board but just a heatsink on top of the chips. The heatsink was sitting at around 35C but the board (the top) itself was getting to over 60C. I suspect I need more via's to better transfer overall heat to the bottom of the board. Or figure out a heatsink that can cover the board top itself.

Clearly immersion is the way to go

Lucko

hero member

Activity: 826

Merit: 1000

How much power do you need for that(40GH)?

Topic: [CLOSED] Bitmine CoinCraft A1 28nm chip distribution / DIY support - page 4. (Read 81340 times)