Pages:
Author

Topic: [CLOSED] Bitmine CoinCraft A1 28nm chip distribution / DIY support - page 4. (Read 81318 times)

legendary
Activity: 1029
Merit: 1000
I've had similiar problem. The problem was power supply. When I send write reg first and underclock chip to 200MHz then bist fix returned 31 cores, otherwise it was 1 (bist fix first then write reg).
donator
Activity: 919
Merit: 1000
Quick question:

How many functional cores are you guys getting on your chips?
I've tested 4 tonight, and they all seem to only have 1 out of 32.
Are all sample chips like this? Or am I really, really unlucky?

The time to complete the test job is 5.2 sec @ 800MHz, which is consistent with
only 1 of 32 cores going. The chips aren't heating up much at all, either. Even
at 800MHz.

No, that is for sure not ok. As a rule of thumb, I would classify the chips as follows:
  • grade A: 31 - 32 working cores: run at full speed
  • grade B: 26 - 30 working cores: run at reduced speed (50-80% of nominal), bears low risk to disturb chain
  • grade C: 25 and less working cores: disable chip (or enable only after proper inspection), bears high risk to disturb chain

Alas, I have never seen a chip with only one working core so far. Are you sure your power supply is stable at ~850mV and you assert reset for ~1s before you issue the BIST command?
newbie
Activity: 30
Merit: 0
Quick question:

How many functional cores are you guys getting on your chips?
I've tested 4 tonight, and they all seem to only have 1 out of 32.
Are all sample chips like this? Or am I really, really unlucky?

The time to complete the test job is 5.2 sec @ 800MHz, which is consistent with
only 1 of 32 cores going. The chips aren't heating up much at all, either. Even
at 800MHz.
newbie
Activity: 30
Merit: 0
newbie
Activity: 30
Merit: 0
Here are some more pics from mazurov and [gadget]'s build.

A few boards we've put together:

http://i.imgur.com/r1pUq5G.jpg

View of the DCDC area:

http://i.imgur.com/23Aiuzp.jpg

View of the A1 area:

http://i.imgur.com/GkYiUPf.jpg

Heatsink (let's see how far it takes us):

http://i.imgur.com/x7427xS.jpg

The one tool we couldn't have done without was the microscope:

http://i.imgur.com/AtWUq26.jpg

And for those who read this far, here is a small treat - a corrected BOM. I can verify that these parts will get you to a working board (at least during bring-up Smiley

https://docs.google.com/spreadsheet/ccc?key=0AkO84VcUgOWgdFA5S0tGQTcxVVViX0I1VUlPaHhISEE&usp=sharing
newbie
Activity: 2
Merit: 0
There is a TI TXB0106 -based level translator I made to talk to A1s. Requires VCC from the MCU board for high voltage side to function and translate correctly. Also provides 1.8V for A1s data interface. The LDO on the left is TI LP3871-1.8V. The circuit is trivial.

I need a second one and it takes too much time to build on a protoboard. I'm routing a PCB, will post when ready.

https://www.circuitsathome.com/wp/wp-content/uploads/2014/04/bc_levelshifter.jpg
newbie
Activity: 2
Merit: 0
Hello everybody

I've found a mess in the BOM of the two chips reference board!
So far the major issue is that there is a mixup of two versions: Ver. 1.0.a and Ver. 1.0.b.
The first evident difference is that the last release is missing C503 and C504, having therefore a different design.
In the repository there is a mix of files of the two versions and is impossible, at least for myself, to cross check diagram and BOM; I was doing such a check, finding some incongruences, when I discovered the problem.
If someone could help me to setup a 100% error free BOM, or at least provide me a diagram of the ver 1.0.b, I will appreciate very much and will be happy to send him/her a free PCB (I have 100 of them waiting to be populated).

Thank you

The only difference is R12. Take a look at my site, the notes should still be on the front page.
newbie
Activity: 5
Merit: 0
I don't know if someone has already mentioned this, but we found out that engineering chips can be undervolted by first starting up at 0.85V, then when chips are hashing adjust output voltage. This would require that your DC/DC feedback is designed to allow dynamic voltage adjustment without making too much under-/overshoots.


- Noncetech
member
Activity: 101
Merit: 10
no avatar for now

Quote
But when we tried to push it past 1050MHz clock (to all the way to 1200MHz) it seems that cgminer is showing us wrong results. Cgminer showed a bit smaller hashing speed than expected (Sys_clk * 32), but it kept on going all the way to 38GH/s per chip. HW errors were very small, smaller than 32GH/s settings. Did not have any rejections or stales.


Hello,

a diverging hashrate at pool and cgminer simply means you are losing shares through HW errors.

What you need to consider is:

a) a detected HW error also implies that there were errors on true results; the related probability needs to be derived correctly, but I would assume that when you have a HW error rate of 5% it also means you are missing 5% of real results


I have noticed the same thing when pushing the hardware beyond 25GH/s. In my case I'm looping the test vector zefir had posted a while back. Since this has known nonces I can verify that the hardware is returning the correct nonce sequence. Irrespective of errors the time taken for each chip to finish a job always seems to correlate very closely to the configured hash rate.

I notice that the hardware tends to drop nonces before it starts to produce bad ones. As I push the chip harder and harder the "good" nonce rate drops to zero and bad nonces become frequent.

However, this ultimately is all a symptom of too low core voltage. 35GH/s gets pretty stable at around 1.050V. I had previously thought it stable at 0.975V but longer tests started producing more errors…

I've modified my supply to get higher output voltages but haven't gotten back to testing it yet.

You do need aggressive cooling at these voltages so be careful! You can get away with short runs with minimal cooling but be careful. Even sitting idle at these voltages it's easy to generate enough heat to pop a chip (as I learned the other day when my code crashed int he debugger and I got distracted trying to figure out a bug I had been seeing from time to time).

Would isolation of SPI cables uC<>blades help ? Shielding them like S/FTP LAN ?
newbie
Activity: 26
Merit: 0

Quote
But when we tried to push it past 1050MHz clock (to all the way to 1200MHz) it seems that cgminer is showing us wrong results. Cgminer showed a bit smaller hashing speed than expected (Sys_clk * 32), but it kept on going all the way to 38GH/s per chip. HW errors were very small, smaller than 32GH/s settings. Did not have any rejections or stales.


Hello,

a diverging hashrate at pool and cgminer simply means you are losing shares through HW errors.

What you need to consider is:

a) a detected HW error also implies that there were errors on true results; the related probability needs to be derived correctly, but I would assume that when you have a HW error rate of 5% it also means you are missing 5% of real results


I have noticed the same thing when pushing the hardware beyond 25GH/s. In my case I'm looping the test vector zefir had posted a while back. Since this has known nonces I can verify that the hardware is returning the correct nonce sequence. Irrespective of errors the time taken for each chip to finish a job always seems to correlate very closely to the configured hash rate.

I notice that the hardware tends to drop nonces before it starts to produce bad ones. As I push the chip harder and harder the "good" nonce rate drops to zero and bad nonces become frequent.

However, this ultimately is all a symptom of too low core voltage. 35GH/s gets pretty stable at around 1.050V. I had previously thought it stable at 0.975V but longer tests started producing more errors…

I've modified my supply to get higher output voltages but haven't gotten back to testing it yet.

You do need aggressive cooling at these voltages so be careful! You can get away with short runs with minimal cooling but be careful. Even sitting idle at these voltages it's easy to generate enough heat to pop a chip (as I learned the other day when my code crashed int he debugger and I got distracted trying to figure out a bug I had been seeing from time to time).
donator
Activity: 919
Merit: 1000
Info: Clarification on HW errors / hashrate

I got the below SW support request via PM which I think is relevant for other DIY projects and therefore want to respond here publicly.

Quote
Hello Zefir,

The chips are working nicely!

But when we tried to push it past 1050MHz clock (to all the way to 1200MHz) it seems that cgminer is showing us wrong results. Cgminer showed a bit smaller hashing speed than expected (Sys_clk * 32), but it kept on going all the way to 38GH/s per chip. HW errors were very small, smaller than 32GH/s settings. Did not have any rejections or stales.

I checked also the PLL setting trace and it corresponded datasheet (fbdiv = 71-78, pre and postdivs were at 1, our ref clk is 16MHz).

Then we examined pool's results, it was showing rather 250GHs - 330GH/s. Then we switched back to slower setting, pools were showing immediately higher hashing speeds.

Could give us some advice on this, or point out where could be the possible reason. (We are using the latest cgminer bitmine-A1-driver fork).


Thank you in advance!

Hello,

a diverging hashrate at pool and cgminer simply means you are losing shares through HW errors.

What you need to consider is:

a) a detected HW error also implies that there were errors on true results; the related probability needs to be derived correctly, but I would assume that when you have a HW error rate of 5% it also means you are missing 5% of real results

b) the A1 uses real target, that is, if your pool sends you diff256 work, A1 filters any result witch lower difficulty. In that case, generating a HW error is (at least, needs correct mathematical analysis) 256 times less probable (since you need to generate a wrong diff256 share) therefore you won't see many HW errors with increasing difficulty. Equivalently, because of a) HW errors will cause loss of wrongly calculated real shares.


The current cgminer driver for the A1 is meant for a field deployment where optimal hashrate was measured before and PLL is not tuned by users. If you need to have some meaningful feedback on HW errors to tune your system, you can achieve this easily by letting the A1 report Diff1 shares. For that, you basically need to prevent setting the real target for the jobs with this patch:
Code:
diff --git a/driver-SPI-bitmine-A1.c b/driver-SPI-bitmine-A1.c
index 81df48d..0104c34 100644
--- a/driver-SPI-bitmine-A1.c
+++ b/driver-SPI-bitmine-A1.c
@@ -652,7 +652,6 @@ static uint8_t *create_job(uint8_t chip_id, uint8_t job_id, struct work *work)
        p1[0] = bswap_32(p2[0]);
        p1[1] = bswap_32(p2[1]);
        p1[2] = bswap_32(p2[2]);
-       p1[4] = get_diff(work->sdiff);
        return job;
 }



Good Luck
newbie
Activity: 5
Merit: 0

Just ran for short period of time. And here is snapshot from our build:

http://i.imgur.com/uDNUiKz.jpg


- Noncetech

Noncetech: how many A1's are under that heatsink/fan combo? Is the heatsink we see cooling the chips or the board? Do you have cooling on both sides? And finally, do you have current draw numbers?

Looks great though!

-a[g


The heatsink+fan combo seen on the picture is about 0.22W/Cdeg heatsink fan combo, which is cooling the topside of A1 chips. We have another bigger 0.16W/Cdeg heatsink+fan cooling the board. We measured some temperatures today at 800MHz and it was quite stable at 65-67Cdeg and for another board 55-58Cdeg. Our FETs and inductors stablized around 60Cdeg.

We had two board stack-up configuration. I guess the another board is not getting fresh air enough. We may need to adjust the upper heatsink alignment, so that its pushing the air out properly.

We have 8 chips under those heatsinks.

EDIT: We don't have proper equipment for accurate on-board current measurements at this moment, thus unable to measure our buck's efficiency at different loads. But we had a wattage meter. I don't have numbers now for 25GH/s setting, but we were getting around 490-510W in total at ~490GH/s (16 chips configuration). Raspberry and ATX PSU were drawing 25W, so it needs to be reduced from total amount in order to get board specific. All the chips did overclock quite nicely. Still need to determine the optimum spot.

http://i.imgur.com/pUAjFZ8.jpg

- Noncetech
member
Activity: 102
Merit: 10

Just ran for short period of time. And here is snapshot from our build:




- Noncetech

Noncetech: how many A1's are under that heatsink/fan combo? Is the heatsink we see cooling the chips or the board? Do you have cooling on both sides? And finally, do you have current draw numbers?

Looks great though!

-a[g
newbie
Activity: 5
Merit: 0
Hello everybody

I've found a mess in the BOM of the two chips reference board!
So far the major issue is that there is a mixup of two versions: Ver. 1.0.a and Ver. 1.0.b.
The first evident difference is that the last release is missing C503 and C504, having therefore a different design.
In the repository there is a mix of files of the two versions and is impossible, at least for myself, to cross check diagram and BOM; I was doing such a check, finding some incongruences, when I discovered the problem.
If someone could help me to setup a 100% error free BOM, or at least provide me a diagram of the ver 1.0.b, I will appreciate very much and will be happy to send him/her a free PCB (I have 100 of them waiting to be populated).

Thank you
newbie
Activity: 5
Merit: 0
Here is our initial testing results (2x4 IC configuration @ 0.85V):

http://i.imgur.com/HGszbm0.jpg

Just ran for short period of time. And here is snapshot from our build:

http://i.imgur.com/uDNUiKz.jpg


- Noncetech
hero member
Activity: 924
Merit: 1000
The WPC EE says:

Quote
I am blinking the lights!

So:

1. 3.3V regulator working
2. JTAG interface to UC3C working
3. Processor is initializing
4. ASF now working (took moving to Studio 6.2, since all UC3C was broken in 6.1)
5. 12MHz oscillator is working
6. Can do one LED - all 3 colors are working

Yay! (Finally!)

member
Activity: 101
Merit: 10
no avatar for now
I added code to trim my supply and the good news is that it looks like the board is stable at 35GH/s at 0.975V. To get any faster than that I need to get over the max 1.050V my supply is able to put out (to do that I need to disassemble the cooling and change a sense resistor - not terrible but it is a hassle).

A single chip (the board has four) almost works at 1.050V/40GH/s. If I run all four then the supply under load drops to about 1.030V which doesn't work too well.

Need to validate this on more than one board of course Smiley

It's useless to operate any PS on border limit power, since power flicker will reset your boards...

Use 2 or more...we have such solution if you have no clue...let me know.
hero member
Activity: 924
Merit: 1000
newbie
Activity: 26
Merit: 0
I failed to measure current at 40GH - I can check that next time I try it.

35GH/s was [email protected] = 38W.

I did try a longer run under bfgminer and was seeing some hardware errors (about 0.5%). Not sure why my test wasn't catching them - I was just running zefir's test vector over and over (and validating that the correct nonces were returned). Guess I need more test vectors Smiley

The board did start getting pretty hot. I have a water block attached to the bottom of the board but just a heatsink on top of the chips. The heatsink was sitting at around 35C but the board (the top) itself was getting to over 60C. I suspect I need more via's to better transfer overall heat to the bottom of the board. Or figure out a heatsink that can cover the board top itself.

Clearly immersion is the way to go Smiley
hero member
Activity: 826
Merit: 1000
How much power do you need for that(40GH)?
Pages:
Jump to: