Author

Topic: KnC die #0 disabled (Read 4121 times)

legendary
Activity: 1848
Merit: 1001
November 11, 2013, 02:41:02 AM
#22
BUMP!!


11 days later!!! BUMP ffs!
hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 08, 2013, 01:38:48 PM
#21
Here are the evidence:




Hair dryer in action







Come on KNC ORSOC Guys!!!!!
hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 08, 2013, 07:32:23 AM
#20
Hi Bitcoinorama,

i hope that you have a good connection to KNC, can you send them my post?

They can send me about .2 btc btc for my work:   12feS3BnvYkYAf3wsrrftmeNrw1B5HRSZ1



I figured out what it is.

This procedure is repeatable, every one can check this!

It is an temperature problem !!!!!!

die 0 and die 1 are on the left side from the board and die 2 and die 3 are on the right side.

CASE 1

1. updating from 0.98 to 0.98.1
2. when only die 0 is now working all is normal at this moment and
    when you wait awhile die 0 dies and the other ones start = this is normal! Why?

3. the monitordcdc script send commands to all 4 die´s in this moment goes the temp on them higher (doing something)
    volt power ampere working bit = heat

4. now die 3 or die 4 recognized that and switch on! Die 4 said to die 3 hey come up, die 3 said to die 2 hey come up but die 2
   has dementia and for got to say die 0 hey come up. Or the last case die 0 said no i am the master only the controller can
   send commands to me, hmm, but when i was looking to the SPM bus documentation i can't find such things.

   Why? See the picture, an picture say more then 1 million words:


Talking within 0 to 4 and not 4 to 0 or 3 to 2 or 2 to 1

ONLY 1 or 2 or 3 to the next



CASE 2

1. same like above
2. same like above

NOW NOW NOW NOW NOW

3. I take an hairdryer an blow slowly warm air to the die 0 an 1 an the left side of the Asic Borad!
4. the temperature walks slowly higher, near the point to 69°C i stopped the hairdryer and some seconds
    later only die 1 is start working! NOW NOW WOW WOW WOW

When you look here in the forum you will recognizing that many people talks about: HIGHER Temp = better performance!

Exactly thats the problem, in SPM bus Docu you will find an command to switch on the die or off.


Solution / Problem here:
1. ########################################
10.8.2. Sending Too Few Bits
PMBus (and SMBus) transactions are carried out one byte at time. If while a device is writing to a PMBus device the transmission is interrupted by a START or STOP condition before a complete byte has been sent, this is a data transmission fault.
When a PMBus device detects this fault, it shall respond as follows:
© 2007 System Management Interface Forum, Inc.

Page 39 of 98 All Rights Reserved

PMBus Power System Mgt Protocol Specification – Part II – Revision 1.1
• Flush or ignore the received command code and any received data,
• Set the CML bit in the STATUS_BYTE,
• Set bit [1] (“Other” fault) bit in the STATUS_CML register (if supported), and
• Notify the host as described in Section 10.2.2.

READ on from here

2. ########################################
10.8.7. Device Busy

ME: Before sending commands we had to stop the device and send then the command

3.
11.2. STORE_DEFAULT_ALL
The STORE_DEFAULT_ALL command instructs the PMBus device to copy the entire contents of the Operating Memory to the matching locations in the non-volatile Default
© 2007 System Management Interface Forum, Inc.

Page 43 of 98 All Rights Reserved
Data Byte Value
Meaning
1000 0000
Disable all writes except to the WRITE_PROTECT command

0100 0000
Disable all writes except to the WRITE_PROTECT, OPERATION and PAGE commands

0010 0000
Disable all writes except to the WRITE_PROTECT, OPERATION, PAGE, ON_OFF_CONFIG and VOUT_COMMAND commands

0000 0000
Enable writes to all commands.

PMBus Power System Mgt Protocol Specification – Part II – Revision 1.1

Store memory. Any items in Operating Memory that do not have matching locations in the Default Store are ignored.
It is permitted to use the STORE_DEFAULT_ALL command while the device is operating. However, the device may be unresponsive during the copy operation with unpredictable, undesirable or even catastrophic results. PMBus device users are urged to contact the PMBus device manufacturer about the consequences of using the STORE_DEFAULT command while the device is operating and providing output power.
This command has no data bytes. This command is write only.

4.
11.3. RESTORE_DEFAULT_ALL

ME: I think that can we figure out!

The RESTORE_DEFAULT_ALL command instructs the PMBus device to copy the entire contents of the non-volatile Default Store memory to the matching locations in the Operating Memory. The values in the Operating Memory are overwritten by the value retrieved from the Default Store. Any items in Default Store that do not have matching locations in the Operating Memory are ignored.
It is permitted to use the RESTORE_DEFAULT_ALL command while the device is operating. However, the device may be unresponsive during the copy operation with unpredictable, undesirable or even catastrophic results. PMBus device users are urged to contact the PMBus device manufacturer about the consequences of using the RESTORE_DEFAULT_ALL command while the device is operating and providing output power.

This command has no data bytes. This command is write only.

5. All possible commands:

Starting from page 73



6. Potential Conflict





Info´s here:

http://pmbus.org/docs/PMBus_Revision_1-2_Presentation_20100228.pdf
http://pmbus.org/docs/PMBus_Specification_Part_I_Rev_1-1_20070205.pdf
http://pmbus.org/docs/PMBus_Specification_Part_II_Rev_1-1_20070205.pdf


I am an absolut beginner, an have from SPM bus programming absolutely no know how, but one thing i know is
that the CONTROLLING of the temps inside the die is the problem!



sr. member
Activity: 467
Merit: 250
November 08, 2013, 04:35:17 AM
#19
a = bus
b = channel

PMBus has redundant bus-system, which means here, that when you send data
to:

Bus 2 its enough

For bus 1 it differs, here has the bus 1 only on channel 24 valid data


So was that what you needed?
hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 08, 2013, 01:05:21 AM
#18
No thats not right,

I need as quick as possible the above dumps, to verify something.

Fubly:

I ran this on a jupiter presently at 16/16. Wrote a quick script to grab what you wanted:

Code:
for BUS in 1 2 3 4 5 6 7 8; do
        for CHANNEL in 20 21 22 23 24 25; do
        echo "dumping ( $BUS : $CHANNEL)"
        i2cdump -y $BUS 0x$CHANNEL
        echo
        done
done



a = bus
b = channel

PMBus has redundant bus-system, which means here, that when you send data
to:

Bus 2 its enough

For bus 1 it differs, here has the bus 1 only on channel 24 valid data
sr. member
Activity: 467
Merit: 250
November 07, 2013, 09:32:58 PM
#17
I hacked up the monitordcdc script, tossing in a 'logger' line to write something to syslog each time it tries to restart a die.. (you have to start /etc/init.d/syslog.busybox as well, and then tail -f /var/log/messages)


yikes.. I"m curious if other people are seeing monitorDCDC having to light cores back up as often as I am..


Quote
Nov  7 23:28:38 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:28:39 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:30:25 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:32:12 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:32:12 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 0
Nov  7 23:32:12 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 1
Nov  7 23:32:12 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:32:12 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 3
Nov  7 23:33:59 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:35:46 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:35:46 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:37:33 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:39:19 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:39:20 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:41:07 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:42:54 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:44:41 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:44:41 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:46:28 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:48:15 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:48:15 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:50:02 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:51:48 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:51:49 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:53:36 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:55:22 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:55:23 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:57:10 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:58:57 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:58:57 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:00:43 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:02:31 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:02:31 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:04:18 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:06:05 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:06:05 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:07:52 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:09:38 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:09:38 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:11:26 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:13:12 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:13:12 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:15:00 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:16:46 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:16:47 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:18:33 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:20:20 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:20:21 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:22:07 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:23:54 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:23:55 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:25:41 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:27:29 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:27:29 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:29:16 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:31:02 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:31:03 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:32:50 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 0
Nov  8 00:32:50 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 1
Nov  8 00:32:50 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:32:50 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 3
Nov  8 00:34:36 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:34:36 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:36:24 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:38:10 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:39:57 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:39:57 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:41:44 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:43:31 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:43:31 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:45:19 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:47:05 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:47:05 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:48:52 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:50:39 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:50:39 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:52:26 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:54:13 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:54:13 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:56:00 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:57:47 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 00:57:48 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 00:59:34 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:01:21 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:01:21 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:03:09 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:04:55 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:04:55 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:06:43 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:08:29 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:08:30 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:10:16 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:12:03 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:12:03 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:13:50 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:15:37 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:15:38 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:17:24 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:19:10 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:19:11 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:20:58 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:22:44 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:22:45 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:24:32 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:26:18 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:26:18 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:28:05 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:29:52 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:29:52 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:31:39 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:33:26 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:33:26 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:35:13 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:37:00 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:37:00 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:38:46 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:40:33 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:40:33 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:42:20 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:44:07 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:44:07 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:45:54 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:47:41 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:47:41 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:49:28 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:51:15 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:51:15 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:53:01 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:54:49 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:54:49 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:56:35 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 01:58:21 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 01:58:22 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:00:09 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:01:56 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 02:01:56 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:03:43 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:05:30 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 02:05:30 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:07:18 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:09:04 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 02:09:05 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:10:51 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:12:38 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 02:12:38 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:14:25 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:16:12 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 02:16:12 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:17:59 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  8 02:19:45 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  8 02:19:45 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2

sr. member
Activity: 467
Merit: 250
November 07, 2013, 09:28:19 PM
#16
No thats not right,

I need as quick as possible the above dumps, to verify something.

Fubly:

I ran this on a jupiter presently at 16/16. Wrote a quick script to grab what you wanted:

Code:
for a in 1 2 3 4 5 6 7 8; do
        for b in 20 21 22 23 24 25; do
        echo "dumping ( $a : $b )"
        i2cdump -y $a 0x$b
        echo
        done
done


Here's the output : http://pastebin.com/ZhzGZBuf

hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 07, 2013, 08:04:14 PM
#15


0x20 = module 0
0x22 = module 1
0x24 = module 2
0x27 = module 3 (I think)

0, 1, 2, 3 = dies

So that equates to 3rd module, 3rd core, then first module, 4th core, then repeatedly 3rd module, 3rd core.

In my case, 0/4 can be restarted, and it does every few minutes when it stops, but 3/3 will never restart.. but the script tries repeatedly.



0 to 6 = the channel where are the a sic bord is attached

0x20 = Asic board on port / channel 1
0x21 = Asic board on port / channel 2
0x22 = Asic board on port / channel 3
0x23 = Asic board on port / channel 4
0x24 = Asic board on port / channel 5
0x25 = Asic board on port / channel 6

hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 07, 2013, 07:33:52 PM
#14
No thats not right,

SPM-Bus Protocol has redundant bus system,

master - slave.

Seen here: http://pmbus.org/docs/Using_The_PMBus_20051012.pdf

have a look to page 142


I need as quick as possible the above dumps, to verify something.
sr. member
Activity: 467
Merit: 250
November 07, 2013, 06:22:59 PM
#13
Monitordcdc has more changes: Interval for checking VRMs that ouput zero current in monitordcdc was decreased from 15 minutes to 20 seconds (15 checks in 1minute vs 5 checks in 4secs). When VRM has more than 3 failures(=zero current output) in this 20 sec interval the die powered by this VRM is restarted (this was not present in 0.98). I am not sure why die 0 is restarted only when other dies have failed too (maybe die 0 is somehow connected to other dies?).

I hacked up the monitordcdc script, tossing in a 'logger' line to write something to syslog each time it tries to restart a die.. (you have to start /etc/init.d/syslog.busybox as well, and then tail -f /var/log/messages)

like this:
Code:
                if [ "$failed1" = "1" ] ; then
                        i2cset -y 2 0x2$channel 0xe5 1
                        logger -t "dcdc" "i2cset -y 2 0x2$channel 0xe5 1 "



What I see looks like it's trying to restart individual dies, no matter which one it is.

Quote
Nov  7 23:10:49 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:12:35 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:12:35 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:14:23 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:16:09 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:17:57 knc2 user.notice dcdc: i2cset -y 2 0x20 0xe5 3
Nov  7 23:17:57 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2
Nov  7 23:19:43 knc2 user.notice dcdc: i2cset -y 2 0x24 0xe5 2

0x20 = module 0
0x22 = module 1
0x24 = module 2
0x27 = module 3 (I think)

0, 1, 2, 3 = dies

So that equates to 3rd module, 3rd core, then first module, 4th core, then repeatedly 3rd module, 3rd core.

In my case, 0/4 can be restarted, and it does every few minutes when it stops, but 3/3 will never restart.. but the script tries repeatedly.

legendary
Activity: 1330
Merit: 1026
Mining since 2010 & Hosting since 2012
November 07, 2013, 02:21:56 PM
#12
FYI:  I applied 0.98-1 to a client's Jupiter unit that was at 320 GH and it fixed the issue and now it is well over 500 stable for almost a week. 
hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 07, 2013, 02:20:24 PM
#11
Can anyone with an full speed machine send me the results of this:

i2cdump -y 1 0x24

i2cdump -y 2 0x20

i2cdump -y 2 0x21

i2cdump -y 2 0x22

i2cdump -y 2 0x23

i2cdump -y 2 0x24

i2cdump -y 2 0x25

i2cdump -y 3 0x20

i2cdump -y 3 0x21

i2cdump -y 3 0x22

i2cdump -y 3 0x23

i2cdump -y 3 0x24

i2cdump -y 3 0x25

i2cdump -y 4 0x20

i2cdump -y 4 0x21

i2cdump -y 4 0x22

i2cdump -y 4 0x23

i2cdump -y 4 0x24

i2cdump -y 4 0x25

i2cdump -y 5 0x20

i2cdump -y 5 0x21

i2cdump -y 5 0x22

i2cdump -y 5 0x23

i2cdump -y 5 0x24

i2cdump -y 5 0x25

i2cdump -y 6 0x20

i2cdump -y 6 0x21

i2cdump -y 6 0x22

i2cdump -y 6 0x23

i2cdump -y 6 0x24

i2cdump -y 6 0x25

i2cdump -y 7 0x20

i2cdump -y 7 0x21

i2cdump -y 7 0x22

i2cdump -y 7 0x23

i2cdump -y 7 0x24

i2cdump -y 7 0x25

i2cdump -y 8 0x20

i2cdump -y 8 0x21

i2cdump -y 8 0x22

i2cdump -y 8 0x23

i2cdump -y 8 0x24

i2cdump -y 8 0x25
sr. member
Activity: 462
Merit: 250
November 02, 2013, 05:19:07 PM
#10
I am not sure why die 0 is restarted only when other dies have failed too (maybe die 0 is somehow connected to other dies?).

Code:
		# restart die
if [ "$failed0" = "1" ] ; then
if [ "$failed_non0" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 0
fi
fi
if [ "$failed1" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 1
fi
if [ "$failed2" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 2
fi
if [ "$failed3" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 3
fi



seems that they had a lot of die0 DOA so I guess they are targetting that for quicker testing..  but yes I have a die3 out on a board for weeks.

I am running 98.1 now on it but it didn't bring it back yet - will be interesting if KNC elaborates on their methods 
legendary
Activity: 2408
Merit: 1004
November 02, 2013, 01:15:30 PM
#9
If have good miner
Is good idea to install these firmware or not?
newbie
Activity: 11
Merit: 0
November 02, 2013, 11:57:05 AM
#8
Does anyone know what exactly this firmware does?

It seems to get mixed results.

I am only getting 460 on my Jupiter. I am tempted to try this firmware.

tl;dr: this patch lowers some voltage on controller board from 1.95V to 1.45V and tries to restart failed dies in 20 sec intervals.

Long story:
Running diff on images
Code:
diff -rq kncminer-0.98/ramdisk kncminer-0.98.1\(beta\)/ramdisk 2> /dev/null
results in
Code:
Files kncminer-0.98/ramdisk/etc/init.d/initc.sh and kncminer-0.98.1(beta)/ramdisk/etc/init.d/initc.sh differ
Files kncminer-0.98/ramdisk/etc/rcS.d/S36initc.sh and kncminer-0.98.1(beta)/ramdisk/etc/rcS.d/S36initc.sh differ
Files kncminer-0.98/ramdisk/etc/shadow and kncminer-0.98.1(beta)/ramdisk/etc/shadow differ
Files kncminer-0.98/ramdisk/sbin/monitordcdc and kncminer-0.98.1(beta)/ramdisk/sbin/monitordcdc differ
Files kncminer-0.98/ramdisk/usr/sbin/lighttpd and kncminer-0.98.1(beta)/ramdisk/usr/sbin/lighttpd differ
Files kncminer-0.98/ramdisk/var/cache/ldconfig/aux-cache and kncminer-0.98.1(beta)/ramdisk/var/cache/ldconfig/aux-cache differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/initc-bin.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/initc-bin.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/initscripts.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/initscripts.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-access.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-access.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-accesslog.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-accesslog.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-auth.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-auth.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-cgi.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-cgi.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-dirlisting.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-dirlisting.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-expire.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-expire.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-indexfile.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-indexfile.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd-module-staticfile.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd-module-staticfile.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/info/lighttpd.control and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/info/lighttpd.control differ
Files kncminer-0.98/ramdisk/var/lib/opkg/status and kncminer-0.98.1(beta)/ramdisk/var/lib/opkg/status differ
Files kncminer-0.98/ramdisk/www/pages/firmware_upgrade.html and kncminer-0.98.1(beta)/ramdisk/www/pages/firmware_upgrade.html differ

We are actually interested in /etc/init.d/initc.sh and /sbin/monitordcdc as other files are just some new versions or changed timestamps.

In initc.sh few lines were added at the end of file. First four lines sets DCDC1 voltage adjustment in controller board voltage controller to 1.450 V (value for 0.98 on my Mercury is 1.950 V). The rest sets GO flag in Slew rate register in order to apply voltage change. I am not sure what exactly is powered by this voltage.
Code:
v=56
i2cset -y 1 0x24 0xb 0x73
i2cset -y 1 0x24 0xe $v
i2cset -y 1 0x24 0xb 0x73
i2cset -y 1 0x24 0xe $v

i2cset -y 1 0x24 0xb 0x6c
i2cset -y 1 0x24 0x11 0x86
i2cset -y 1 0x24 0xb 0x6c
i2cset -y 1 0x24 0x11 0x86


Monitordcdc has more changes: Interval for checking VRMs that ouput zero current in monitordcdc was decreased from 15 minutes to 20 seconds (15 checks in 1minute vs 5 checks in 4secs). When VRM has more than 3 failures(=zero current output) in this 20 sec interval the die powered by this VRM is restarted (this was not present in 0.98). I am not sure why die 0 is restarted only when other dies have failed too (maybe die 0 is somehow connected to other dies?).

Code:
		# restart die
if [ "$failed0" = "1" ] ; then
if [ "$failed_non0" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 0
fi
fi
if [ "$failed1" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 1
fi
if [ "$failed2" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 2
fi
if [ "$failed3" = "1" ] ; then
i2cset -y 2 0x2$channel 0xe5 3
fi


Dies 1-3 are also restarted in the beginning of the script.

Code:
# Give them a kick!
i2cset -y 2 0x2$channel 0xe5 1 >/dev/null 2>&1
i2cset -y 2 0x2$channel 0xe5 2 >/dev/null 2>&1
i2cset -y 2 0x2$channel 0xe5 3 >/dev/null 2>&1
legendary
Activity: 1512
Merit: 1000
@theshmadz
November 02, 2013, 08:41:19 AM
#7
Turns out crappy cooling appears to provide better hashrate, as many others have also reported...

Do the KNC units vary their fan speeds? Could be a symptom of the board not getting enough airflow > chips gets hotter > fan speed increases > general airflow over board increases.

The fans on the heatsinks have a 4 pin connector, so they definitely have the ability to vary the speed. I have no idea if they actually do or not. It seems as though the CPU coolers are complete overkill though. I tried it out with removing all the fans, and just put one external fan blowing at the system with the case open and the temps were 40 degrees or less, but the hashrate in that format was around 450-490.

after more than an hour now, the rear 2 modules are staying steady at 70 degrees and the front ones are still at 60.  (+/- 2 degrees)

The hashrate reported by the device is now 558, and hashrate at the pool is reporting 549. I'm extremely happy right now  Grin

*edit* this is using the .98.1 firmware. I tested the exact same setup with the .98 firmware and I initially saw good results but within a few hours it was back down to 460 or so. it's only been a couple hours with the new firmware so the jury is still out on this one. Will have to wait and see if it can maintain these speeds, but it's looking solid so far.

Also, as a side note. If the optimal temperature is actually 70, then I would like to push more voltage and clockspeed to this thing until you reach the point where you are struggling to keep it under say 75 or so... restricting the airflow to intentionally increase the temperature of computer hardware is really twisting my stomach in knots.
hero member
Activity: 575
Merit: 500
November 02, 2013, 08:16:11 AM
#6
Turns out crappy cooling appears to provide better hashrate, as many others have also reported...

Do the KNC units vary their fan speeds? Could be a symptom of the board not getting enough airflow > chips gets hotter > fan speed increases > general airflow over board increases.
hero member
Activity: 561
Merit: 521
Trustless IceColdWallet
November 02, 2013, 07:57:09 AM
#5
http://forum.kncminer.com/forum/main-category/main-forum/13767-firmware-beta-0-98-1-feedback-thread

where do you have the source code to look under the hood?

With your mod, it´s possible to enable the DC / DC 0, is it an setting inside the setting, like an config file or what´s the change?

Good Job Roll Eyes
sr. member
Activity: 288
Merit: 250
November 02, 2013, 02:40:52 AM
#4
It worked for my Jupiter the first time I tried it. It went from 414 avg on 0.98 to 545 avg and was still climbing, but I decided to run enablecores and after that reboot (and many others), it didn't work as well. Right now my Jupiter is going at 454 avg.
The 2 Saturns were very slowly going on 0.98.1 to about 200 avg before reboot (211-214 avg on 0.98), but they are now staying at 106 and 140, after few reboots.
So it seems that you need a lucky reboot (?) to have the miner work better?
legendary
Activity: 1512
Merit: 1000
@theshmadz
November 01, 2013, 08:07:15 PM
#3
Does anyone know what exactly this firmware does?

It seems to get mixed results.

I am only getting 460 on my Jupiter. I am tempted to try this firmware.

*edit* So, I tried the firmware, nothing changed. Tried .96.1  .96  .95, enable cores, bunch of different s**t. Then I put the case back on to restrict airflow and re-flashed the .98.1  - the temps of the back 2 are right around 70, the front 2 are around 60.

But the Jupiter is now reporting 534 and climbing. It will take an hour or so for the hashrate at the pool to normalize, but it is slowly rising as well...

Thus far, it appears these things really do like it hot. Originally I hated the design of the cooling for this device, because I thought it was pretty crappy at cooling... Turns out crappy cooling appears to provide better hashrate, as many others have also reported...
hero member
Activity: 532
Merit: 500
November 01, 2013, 06:06:35 PM
#2
You can try this fix.

Note it's totally unofficial, but it's what appears to be working until we have some solid feedback on the test rigs over the next couple of days. Results may ramp up slow, and take upto 3 hours to see a solid performance, but it's brought several I have tested back to life;

www.kncminer.com/userfiles/file/kncminer-0.98.1(beta).bin
legendary
Activity: 966
Merit: 1000
November 01, 2013, 05:59:14 PM
#1
I thought this deserved its own thread.

I'm learning that a lot of the later shipments of KnC's miners have an issue where die #0 out of the 4 on each chip does not function.

This can be seen by installing BertMod and looking at the output current of the DC-DC converters.  Affected units will show a much lower current on #0 than on the other 3.

I'm afraid I can't contribute a lot since I have an early Saturn that is not among the affected units.

There is a much better summary of the issue over on the KnC forums:

http://forum.kncminer.com/forum/main-category/hardware/13049-read-me-first-known-issues-slow-performance-dc-dc-problem-bad-core-map-etc

I thought that a thread here would reach a wider audience, and may make for a more productive discussion.
Jump to: