Pages:
Author

Topic: Klondike - 16 chip ASIC Open Source Board - Preliminary - page 70. (Read 435385 times)

full member
Activity: 176
Merit: 100

I'm exploring the power supply here.
I have a 100kHz 0.5V amplitude signal in 5mS bursts every 10mS on my 1.2V core power. It's showing on 12V in from PSU, and on 3.3V. Does anyone have experience with what can cause that kind of pulsation? When looking in closely at the 100kHz it's quite substantial spikes with dampening oscillations of approx. 100 MHz, 400mV. That's pretty big on a 1.2V supply.


I guess you would have put the ferrite in somewhere if you could?

Can you isolate whether the noise is in the ground or the supply?

Can you use a bench supply for the 1.2V by removing the regulator?
full member
Activity: 176
Merit: 100
@All: are there any options of watercooling?
user toolhead created a very nice solution for Burnins Board and Burnin will actually directly assemble (optionally) his boards with toolheads Watercooling block.

so it would be very nice if I could use my pump and radiator for all my boards including my TH's K64.

There is a separate thread on cooling, https://bitcointalksearch.org/topic/klondike-heatsink-sourcing-208381.

This thread is for the electronics, firmware and drivers.  BKK isn't delivering a product, he's delivering a design - is different point-of-view from Burnin.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
That's good then.
In the first it's 0.983 + 1.048 = 2.031 seconds
In the second it's just 1.016 seconds.
You have to ignore the first time interval because that's the time from cmd to nonce which  indicates how far into the range it is rather than the full range time.
Add up the times with the same WorkID (but don't use the first Work unit since).

1.016 seconds = 264208125 ~ 264 MHz (but probably 260 because I always cut the work short rather than long). Cutting short has no down side but running long gives duplicates.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
I have also noticed that a lot of times there is no activity in the log except for the regular USB scanning. This period of lull always follows "Pushing work from pool 0 to hash queue". I am assuming this is because sometimes the nonce never comes back, and you probably have a certain wait period in cgminer, after which it re-sends the work data to the board.
No. It sends the work to the board only once. Some work has no nonces. About 37% according to kano's stats, and 37% has 1 nonce, and if I recall 17% has 2 , and less often 3 and 4 etc. Back up this thread a ways for actual numbers.

Strangely I think the BFL source code only submits 1 nonce/work because it calls work_completed after the first nonce which removes the work from the queue such that further nonces won't get found. I could be missing something there but that's how I read it.

edit: Kano's numbers were,

1000 results:
374/364/175/64/19/1/2/1/0/0

(thats 0/1/2/3/4/5/6/7/8/9 nonces per work unit.
member
Activity: 86
Merit: 10
Before reducing tick count:

Code:
tg@tg-DP700A3D-DM700A3D-DB701A3D-DP700A7D:~/Desktop/Github/klondike/utils$ ./ktest 
Klondike device opened

Version:10, ProductID:K16, Serial#:deadbeef
Cmds [WAISCE.Q]:

ww

State:W, ASICs:16, Slaves:0
WorkQ:0, WorkID:01, Temp:98, Fan:0, ErrCount:0, HashCount:0, MaxCount:2048
Cmds [WAISCE.Q]:

State:W, ASICs:16, Slaves:0
WorkQ:1, WorkID:01, Temp:98, Fan:0, ErrCount:0, HashCount:1, MaxCount:2048
Cmds [WAISCE.Q]:

Nonce Found - WorkID:01, Value:749fcd72 (0.305 secs) , Nonce:b2cc9f74 GOOD
Cmds [WAISCE.Q]:

Nonce Found - WorkID:01, Value:749fcd72 (1.048 secs) , Nonce:b2cc9f74 GOOD
Cmds [WAISCE.Q]:

Nonce Found - WorkID:02, Value:749fcd72 (0.983 secs) , Nonce:b2cc9f74 GOOD
Cmds [WAISCE.Q]:

Nonce Found - WorkID:02, Value:749fcd72 (1.048 secs) , Nonce:b2cc9f74 GOOD
Cmds [WAISCE.Q]:

After reducing tick count:

Code:
tg@tg-DP700A3D-DM700A3D-DB701A3D-DP700A7D:~/Desktop/Github/klondike/utils$ ./ktest 
Klondike device opened

Version:10, ProductID:K16, Serial#:deadbeef
Cmds [WAISCE.Q]:
ww

State:W, ASICs:16, Slaves:0
WorkQ:0, WorkID:01, Temp:118, Fan:0, ErrCount:0, HashCount:0, MaxCount:1024
Cmds [WAISCE.Q]:

State:W, ASICs:16, Slaves:0
WorkQ:1, WorkID:01, Temp:118, Fan:0, ErrCount:0, HashCount:1, MaxCount:1024
Cmds [WAISCE.Q]:

Nonce Found - WorkID:01, Value:749fcd72 (0.305 secs) , Nonce:b2cc9f74 GOOD
Cmds [WAISCE.Q]:

Nonce Found - WorkID:02, Value:749fcd72 (1.016 secs) , Nonce:b2cc9f74 GOOD
Cmds [WAISCE.Q]:
member
Activity: 86
Merit: 10
After uncommenting line 49 in asic.c, I have noticed the following:

1. In ktest, the nonces are still coming back twice.
2. In cgminer, the hash rate has gone upto 2.4 GH/s (5s) and 790 MH/s (avg). The avg never went above 200 before.

I have also noticed that a lot of times there is no activity in the log except for the regular USB scanning. This period of lull always follows "Pushing work from pool 0 to hash queue". I am assuming this is because sometimes the nonce never comes back, and you probably have a certain wait period in cgminer, after which it re-sends the work data to the board.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
I tried this. I am still getting the nonce back twice.
Probably it's running twice as long as it should, so the tick count needs to be half.
ie.

Status.MaxCount = WORK_TICKS / BankSize;

needs to be,

Status.MaxCount = WORK_TICKS / BankSize / 2;

(dividing by ChipCount will work unless an odd number of chips is mounted)

The time interval between two nonces should be 2^32/clk/16.
eg. at 128 MHz
2^32 / 128000000 /16 = 2.09 seconds

at 300 MHz, 0.895 seconds.

In ktest if you send "ww" it will do 2 works and check the time between them.
It now sends multi-char cmds as a sequence of cmds.
newbie
Activity: 32
Merit: 0
@ BKK and TH: nice to see you talking at this high technical level.   Grin

@All: are there any options of watercooling?
user toolhead created a very nice solution for Burnins Board and Burnin will actually directly assemble (optionally) his boards with toolheads Watercooling block.

so it would be very nice if I could use my pump and radiator for all my boards including my TH's K64.

member
Activity: 86
Merit: 10
edit: Also, did you notice I pushed another update a bit later today which worked better for error rates.

Yes I pulled the latest updates. I think the error rate has come down.

Oh yes, I forgot to mention something else. You want BankRanges not doubled but have to uncomment line 49 in the asic.c write code. As below,

    // disable for single bank last_bit0 = last_bit1 = split;

should be,

    last_bit0 = last_bit1 = split;

This causes it to write the high bit 0 for bank 1, and high bit 1 for bank 2, effectively splitting the ranges over both banks.

Give that a whirl. I haven't mounted chips in the second bank yet and so this will be first testing of that.

I tried this. I am still getting the nonce back twice.
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
As you know we have all the 16 chips populated. This is what I changed in klondike.c

Code:
    Status.ChipCount = 16; // just for testing
    
    // pre-calc nonce range values
    BankSize = Status.ChipCount/2; //(Status.ChipCount+1)/2;
    Status.MaxCount = WORK_TICKS / BankSize;
    NonceRanges[0] = 0;
    for(BYTE x = 1; x < BankSize; x++)
        NonceRanges[x] = NonceRanges[x-1] + BankRanges[BankSize-1];  // single bank, double range size

Now all the nonces are coming back twice. If we change back the last line to
Code:
NonceRanges[x] = NonceRanges[x-1] + 2*BankRanges[BankSize-1]; 
the nonces come back only once, but take about double the time. What's the correct setting for 16 chips?
Oh yes, I forgot to mention something else. You want BankRanges not doubled but have to uncomment line 49 in the asic.c write code. As below,

    // disable for single bank last_bit0 = last_bit1 = split;

should be,

    last_bit0 = last_bit1 = split;

This causes it to write the high bit 0 for bank 1, and high bit 1 for bank 2, effectively splitting the ranges over both banks.

Give that a whirl. I haven't mounted chips in the second bank yet and so this will be first testing of that.

edit: Also, did you notice I pushed another update a bit later today which worked better for error rates.
member
Activity: 86
Merit: 10
As you know we have all the 16 chips populated. This is what I changed in klondike.c

Code:
    Status.ChipCount = 16; // just for testing
   
    // pre-calc nonce range values
    BankSize = Status.ChipCount/2; //(Status.ChipCount+1)/2;
    Status.MaxCount = WORK_TICKS / BankSize;
    NonceRanges[0] = 0;
    for(BYTE x = 1; x < BankSize; x++)
        NonceRanges[x] = NonceRanges[x-1] + BankRanges[BankSize-1];  // single bank, double range size

Now all the nonces are coming back twice. If we change back the last line to
Code:
NonceRanges[x] = NonceRanges[x-1] + 2*BankRanges[BankSize-1]; 
the nonces come back only once, but take about double the time. What's the correct setting for 16 chips?
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
K1

We run it for short periods 2-3 min , because there is only heatsink
Good to see a K1 running. What sort of HW error rate do you get?

I'm exploring the power supply here.
I have a 100kHz 0.5V amplitude signal in 5mS bursts every 10mS on my 1.2V core power. It's showing on 12V in from PSU, and on 3.3V. Does anyone have experience with what can cause that kind of pulsation? When looking in closely at the 100kHz it's quite substantial spikes with dampening oscillations of approx. 100 MHz, 400mV. That's pretty big on a 1.2V supply.

There's also 600kHz pulses from the switching supply but they're much smaller, about 10mV.

These are present whether I enable or disable the hash clock so I don't think it's the hashing doing it. I've tried turning off everything nearby including FL. lights, TV, Laptop, raspi etc.



Close up of 100kHz spike made up of 100MHz oscillation.

hero member
Activity: 728
Merit: 500
K1

Hashing @ 300 mhz


Hashing @ 350 mhz


We run it for short periods 2-3 min , because there is only heatsink
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
I pushed my current changes up to Github so others can use them.

Note in klondike.c for varying chip counts you need to hard code the actual chip count and if one bank or two. If one bank the range size gets doubled. If two banks it shouldn't be. This isn't coded properly yet and later will be detected during init.

This new code has much better timing values for TMR0 and removes all interrupt services except for result rx. It uses polling instead for USB and TMR0. I'm not sure if this is needed but I'm trying to give the result capture as fast response as possible.

It's working better at 350 than at 300 now. Does that indicate stabilty issues with interference, resonance?

****
Also, the clock value is now same as MHz rate not double like before. So set 300 for 300 etc. and default changed to 256.

Note the Rx edge may be different if a second NOR gate is not used. Rising edge for me but with one gate falling edge, unless you add an inverter to both data and clk (good idea).
full member
Activity: 140
Merit: 100
Right now at 350MHz it's 1 for 270, showing that it can do it. But when will it break down and start averaging out? I don't know if I should spend more time on this or assume that ferrite beads on the PLL supply will help a lot, and make new boards. I don't have any beads here to try. I have them on the way along with parts for building 13 more boards. I'll add it to the current changes.

fwiw - If those beads were your first instinct, then they're most likely the solution. 
I know waiting for resources has a way of driving me down some winding sideroads …

just my 2 bits - thnx for sharing the deets with us!
hero member
Activity: 784
Merit: 1009
firstbits:1MinerQ
Well, at first I thought it was going to work. I got around 100 good nonces before a bad one. But it didn't hold up. It's not worse but only a little better. Although at higher clocks like 350 it works much better than before. Seems the NOR gates condition well enough at 350 but at this speed the IRQ response becomes an issue, and IOC gets me the extra uS needed.

My idea: change the ISR to trigger on IOC for the CLK rather than the UART byte ready (RCIF). And then use a timeout to filter random single bit triggers. This gives me 7 bits extra time to read that first byte, and if noise triggers < 8 bits between results the timeout resets it. And first trials with ktest were super positive. I went all the way up to 390 with manual data and zero errors over a few dozen work units - something never seen before by me.

Alas, in cgminer it did well at first but soon over time averaged out to not much better than before. I tried playing with various timeout counts, and reset methods.

Right now at 350MHz it's 1 for 270, showing that it can do it. But when will it break down and start averaging out? I don't know if I should spend more time on this or assume that ferrite beads on the PLL supply will help a lot, and make new boards. I don't have any beads here to try. I have them on the way along with parts for building 13 more boards. I'll add it to the current changes.

I also added a "noise" count so that every time it detects a <8 bit trigger it counts it. And on that I either see many if the timeout is low, or none if the timeout is longer. So that just adds to the confusion and points towards there is no real noise bits.

Note: to others with boards, eg. TH. If you can get ferrite beads (as in updated BOM) then you may want to test with them cut into the AVDD PLL power lines on each chip). I can't say if they'll help but they could make all the difference here, and it would be nice to know for a final board.

update: 350 MHz, 7 for 522... 1.3%
again: 14 for 699... 2%
sr. member
Activity: 322
Merit: 250
Supersonic
hero member
Activity: 924
Merit: 1000
Writing this just gave me an idea.

objectification == best consultant

"Teaching often leads to the teacher learning more than the student." - Bicknellski
hero member
Activity: 826
Merit: 1001
I don't want to sound mean but ...
Writing this just gave me an idea.
No worries, I have a thick skin and I am glad I could be of indirect help Wink
Works for me too, explain to someone how good I solved something and then realize ...
full member
Activity: 140
Merit: 100
Writing this just gave me an idea.

objectification == best consultant
Pages:
Jump to: