Pages:
Author

Topic: Klondike - 16 chip ASIC Open Source Board - Preliminary - page 82. (Read 435386 times)

sr. member
Activity: 296
Merit: 250
So pool protocol cause a high hw errors?
Makes no sense, I know. And I'm not saying it does, but when I switched to stratum the rates dropped right down. Still scratching my head. I'm just letting both Erupter and Klondike run now. Klondike currently has A:99 R:0 HW:2 - which is the best it's been yet, though not as good as the Erupter at A:293 R:0 HW:3.

With my USB Asicminer Erupters I see higher HW errors when pool misbehaves (goes offline) or when my internet connection misbehaves (goes off). So color me unsurprised. But it is still non-sense.
hero member
Activity: 658
Merit: 500
CCNA: There i fixed the internet.
I dont remember if anyone has asked this prior. Ive been silently watching in the background...

Anyway for the PIC firmware, i remember you stating that it subdivides the nonce range by n chips and pushes those ranges to the chips.

How difficult/possible would it be to rework the FW to do 1 job per chip?

This is just out of curiosity since i put in an order for 5 chips in a group buy + board ( once finalized [TY, T13Hydra]).

-Taugeran
newbie
Activity: 36
Merit: 0
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different  frequency's.  
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.



Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork.

****
I pushed new updates to github earlier with some small tweaks.

The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range.


In my RL job, I previously worked on a project where a PLL was throwing our whole system out of whack. The problem was that it would lock about 50 percent of the time so we would get intermittent valid data with occasional garbage. After thoroughly tracing out various components we observed that there was an unusual amount of noise getting into the PLL thereby causing it to lose it's lock occasionally. This was compounded by there being varying degrees of noise for various frequencies. Once we filtered these out we were able to maintain a continuous lock and produce clean data.

PLL might be a good place to start looking. Just make sure your PLL maintains a good lock.
newbie
Activity: 40
Merit: 0
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different  frequency's.  
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.



Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork.

****
I pushed new updates to github earlier with some small tweaks.

The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range.


Liquid cooling... I wanna see 450.

Dunk it in mineral Oil  Tongue
hero member
Activity: 924
Merit: 1000
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different  frequency's.  
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.



Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork.

****
I pushed new updates to github earlier with some small tweaks.

The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range.


Liquid cooling... I wanna see 450.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.

About that heatsink. Isn't it true, that avalon chips must be cooled from below? I mean you cannot put heatsink on top of the chip, but below the PCB with silicone thermal pad. It's just like block erupter is cooled.
Yes, that's right. The heat sink is mounted under the board. There are 1cm x 1cm exposed pads with thermal vias to help dissipation to the heat sink. Or, rather, the chips are mounted on bottom and heat sink on top - so the board is upside down...
newbie
Activity: 18
Merit: 0
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.

About that heatsink. Isn't it true, that avalon chips must be cooled from below? I mean you cannot put heatsink on top of the chip, but below the PCB with silicone thermal pad. It's just like block erupter is cooled.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
What's the input clock for the avalon running at?
32 MHz

There are 2 PLL control values, R and N. By setting R=32 you get N = 2x MHz rate, which is what I expose as the clk cfg value. Documented range is 500 - 900. But a "half rate" bit allows dividing that by 2. So for N < 500 I set that bit and use 2N for the control value. I don't allow a cfg value below 256 even though the PLL allows down to 250.
cp1
hero member
Activity: 616
Merit: 500
Stop using branwallets
What's the input clock for the avalon running at?
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different  frequency's.  
I hadn't thought about that, but it's possible and perhaps it's tuned for higher frequencies then I'm currently using. We'll see pretty soon. After I get a few more chips mounted I'll add a heat sink and bump up the clock. I think my plan is to add one more on the same bank, and then after that two more on the opposite bank.



Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
I haven't been following that but I'm sure stratum is preferred. And if it works that much better, for whatever reasons, then I'm not going to worry much about getwork.

****
I pushed new updates to github earlier with some small tweaks.

The firmware now takes clock cfg values from 256 up to 900. They are double-the-mhz rate so that's 128 - 450 MHz (not that you can run at 450 but the PLL on the ASIC accepts values that high). The code now detects when <500 and sets the half-clock bit when below. It also excludes 451-499 (ie. 225-249 MHz) by forcing to 450 since the PLL doesn't support that range.
sr. member
Activity: 294
Merit: 250

Isn't the GetWork protocol deprecated anyway? Not that it shouldn't work, but I thought stratum was the preferred protocol.
sr. member
Activity: 378
Merit: 250
Concerning the 128 MHZ vs 150 MHZ issue maybe the internal PLL has stability problems at different  frequency's.  
sr. member
Activity: 476
Merit: 250
So pool protocol cause a high hw errors?
Makes no sense, I know. And I'm not saying it does, but when I switched to stratum the rates dropped right down. Still scratching my head. I'm just letting both Erupter and Klondike run now. Klondike currently has A:99 R:0 HW:2 - which is the best it's been yet, though not as good as the Erupter at A:293 R:0 HW:3.
Well, it make sense and it does not. You have designed k16 from scratch, you dont have asic comm protocol source, you dont know Erupter comm protocol either. So something there are causing the problem. Maybe avolon will release com protocol soon to be sure.
legendary
Activity: 1190
Merit: 1000
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
So pool protocol cause a high hw errors?
Makes no sense, I know. And I'm not saying it does, but when I switched to stratum the rates dropped right down. Still scratching my head. I'm just letting both Erupter and Klondike run now. Klondike currently has A:99 R:0 HW:2 - which is the best it's been yet, though not as good as the Erupter at A:293 R:0 HW:3.
sr. member
Activity: 476
Merit: 250
So pool protocol cause a high hw errors?
hero member
Activity: 924
Merit: 1000
That will do BKK. That will do.


sr. member
Activity: 294
Merit: 250
Sweet!!! Now things are getting really interesting.  Smiley

Thanks again for the hard work and determination!
legendary
Activity: 4634
Merit: 1851
Linux since 1997 RedHat 4
A few things:
1) The HW: is reported in 1diff, but 3.3.1 (and earlier) report A: and R: in shares (which can be any diff - depends on what you are talking to)
Current git reports them in 1diff (i.e. the next cgminer version will only be 1diff for all of HW, A and R) - we changed that a few days ago in git.
In API devs I report both.

2) In current git I have also implemented what we call cps - on Icarus and ModMinerQuad. AMU (asic miner USB) is Icarus
For my mining the AMU at 335MH/s it gets around 1% errors (certainly less than 1.5%)
Without cps you would expect more errors

3) In my API stats I've added 2 new fields: "USB Pipe" amd "USB Delay"
If "USB Pipe" is non-zero then there are USB problems happening that could also be causing errors.
"USB Delay" shows if there are timing 'issues' occurring in the code (cps fixes these and reports them in "USB Delay")
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
Todays Update.

I spent all day testing and trying to find what is causing HW errors.
I also did some comparison/companion testing with the Erupter that a very generous turtle83 sent me. The Klondike and Erupter ran fine together, and the cgminer menu items seem to be fine now too, after updating to 3.3.1.

I spent a lot of time analysing share.logs and running the data through my kslog util to generate work data for ktest. What I found out was that almost all the HW errors are non-repeatable. If I take accepted data and feed it back in manually I get the same nonce out. When I feed similar data that resulted in error nonces I usually get NO nonce out at all. This seems to indicate some problem with midstate/precalc/data not getting into the ASIC correctly rather than errors caused by bad result capture. Now I checked my code several times trying to find anywhere the data gets corrupted before pushing to the ASIC and can't see it.

As the day progressed I found the error rate dropping off as well. After a run of 1.5 hours along with the Erupter I found that the Klondike had a bout a 3% error rate, and the Erupter about 1.5%. But I'd been getting a lot of Rejected shares and I wondered if that was due to the slow speed and delays in submitted shares or what. So this evening I switched from 50btc (getwork) to BTCGuild (stratum) and saw that Rejects dropped a lot, and so far HW Errors are completely gone to 0 (knock on wood).

So it could even be that some problem with generating work with GetWork is sending bad data to the Klondike (?? weird), as with stratum (local block generation) I have not been getting HW errors. I'm trying to understand how that can be. Never see USB disconnects at all now. And if HW errors drop right off with stratum, then I'll probably add another ASIC and start checking the chaining next. Right, now running at 150 MHz clock, no heat sink and it's a bit hottish, but touchable with fingers for about 5 seconds.

Or maybe error rates actually get lower as the clock rate rises because going from 128 to 150 seems to have lowered the HW errors. Hmmm. Figure that out.

Plan for tomorrow: solder down more chips.

Pages:
Jump to: