Pages:
Author

Topic: DiabloMiner GPU Miner - page 64. (Read 866596 times)

legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 04:50:01 PM
Hey guys,

Need some help getting this working.

I ran bitcoin --daemon, then did ./DiabloMiner-OSX.sh --url http://myusername:[email protected]:8337/

Although I've set rpcuser and rpcpassword in bitcoin.conf, all I get is

Code:
[5/20/11 1:47:00 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:00 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:05 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:05 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:06 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:10 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:10 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:11 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden

Any ideas?

Thanks!

The default RPC port on Bitcoin is 8332 not 8337, and if you're connecting to a remote machine you need to add rpcallowip to the bitcoin.conf.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 04:47:59 PM
BTW, I am not going to accept a patch containing a custom concat setup. This is not C.
what do you mean by that?

I mean that this is Java and there is a limit to how ugly code should get before its deemed unmaintainable. Plus, clever use of StringBuilder will do the same thing anyhow.
member
Activity: 77
Merit: 10
May 20, 2011, 04:44:57 PM
BTW, I am not going to accept a patch containing a custom concat setup. This is not C.
what do you mean by that?
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 04:42:52 PM
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now Smiley

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...

I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.

The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.

It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.

Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed.

So... ymm so fucking v.
I've yet to replicate the same results on my system, In fact with 2.4 every time phatk has beaten your miner. Every time I've tried anything other then -v 2 I get slower speeds.
This is on a Sapphire Extreme 5850 on Windows 7 x64.

I wonder if phatk only works well on specific Catalyst releases. No matter what I try, phatk scores the same as phoenix -k poclbm (which both are always slower than real poclbm and mine).

If I can't make phatk run as fast as its supposed to, I can't replicate it's techniques.
member
Activity: 69
Merit: 10
firstbits.com/1c3qpa
May 20, 2011, 03:48:02 PM
Hey guys,

Need some help getting this working.

I ran bitcoin --daemon, then did ./DiabloMiner-OSX.sh --url http://myusername:[email protected]:8337/

Although I've set rpcuser and rpcpassword in bitcoin.conf, all I get is

Code:
[5/20/11 1:47:00 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:00 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:05 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:05 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:06 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:10 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:10 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden
[5/20/11 1:47:11 PM] ERROR: Can't connect to Bitcoin: Bitcoin disconnected during response: 403 access forbidden

Any ideas?

Thanks!
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 03:31:17 PM
If this is just totally unsupported, feel free to smack me. I'm running on a MacPro with both a 5870 and a 5770 in it, which seems perfectly okay doing normal OS things, including games.

If I try running DiabloMiner without any special flags, I get:

[5/20/11 1:17:33 PM] Added ATI Radeon HD 5870 (#1) (10 CU, local work size of 256)
[5/20/11 1:17:34 PM] Added ATI Radeon HD 5870 (#2) (20 CU, local work size of 256)

which doesn't seem right. I'm guessing the 5770 is #1.

With no special flags at all, I'm getting roughly 125M/sec. If I use -D 1 to make it only attach to the first card, it only drops to roughly 100M/sec which leads me to believe something very inefficient is going on.

I've tried various combos of -f, -v and -w and don't seem to be able to do anything but make it worse.

Is this configuration just not going to work at all? Is there any way I can force it to only use the 5870 instead?


I assume OSX is just naming your second card wrong. However, mining in OSX isn't worth it... on Radeon 5xxxs you lose about 40% of your speed. However, it shouldn't be nearly that bad.

I also suspect this cannot be fixed, OSX's OpenCL implementation is braindead.
member
Activity: 90
Merit: 12
May 20, 2011, 01:25:03 PM
If this is just totally unsupported, feel free to smack me. I'm running on a MacPro with both a 5870 and a 5770 in it, which seems perfectly okay doing normal OS things, including games.

If I try running DiabloMiner without any special flags, I get:

[5/20/11 1:17:33 PM] Added ATI Radeon HD 5870 (#1) (10 CU, local work size of 256)
[5/20/11 1:17:34 PM] Added ATI Radeon HD 5870 (#2) (20 CU, local work size of 256)

which doesn't seem right. I'm guessing the 5770 is #1.

With no special flags at all, I'm getting roughly 125M/sec. If I use -D 1 to make it only attach to the first card, it only drops to roughly 100M/sec which leads me to believe something very inefficient is going on.

I've tried various combos of -f, -v and -w and don't seem to be able to do anything but make it worse.

Is this configuration just not going to work at all? Is there any way I can force it to only use the 5870 instead?
full member
Activity: 235
Merit: 100
May 20, 2011, 01:22:51 PM
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now Smiley

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...

I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.

The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.

It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.

Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed.

So... ymm so fucking v.
I've yet to replicate the same results on my system, In fact with 2.4 every time phatk has beaten your miner. Every time I've tried anything other then -v 2 I get slower speeds.
This is on a Sapphire Extreme 5850 on Windows 7 x64.
legendary
Activity: 1540
Merit: 1049
Death to enemies!
May 20, 2011, 07:25:04 AM
I got the new version working! What I did:

1. I run the DiabloMiner-Windows.exe from command prompt with all arguments needed such as -u and -p

2. I need tu manually specify -v 2 argument to use vectors. Without Vectors I have 248Mh/s, with -v 2 I finally got 282Mh/s instead of former 260Mh/s. The BFI_INT is a huge improvement.

3. I created .BAT file myself to run DiabloMiner-Windows.exe with all necessary arguments.
Quote
The bat is probably running the old jar, which means, no, you're not running a new version of DiabloMiner.
No, I'm not so stupid. I know how to use and edit bat files from MS-DOS 5.0 times. I check they contents before I run them.

And Thank You DiabloD3! If I ever find coins with Your miner, I will send you some of them!
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 06:57:46 AM
BTW, I am not going to accept a patch containing a custom concat setup. This is not C.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 06:41:33 AM
newbie
Activity: 14
Merit: 0
May 20, 2011, 06:34:19 AM
From my first run of profiling the miner, I saw that you were spending about 2% cpu time in just building strings (mainly StringBuilder copying char arrays internally).  Using the + operator is inlined to StringBuilder, which can be pretty slow.  I ran into this in my game engine here at work and had come across this post at StackOverflow from a guy that implements his own (albeit primitive) class for string concatenation.

I forgot to save the profile for that one (the profiler automatically overwrites the output file every time and I'm lazy Tongue), but I reduced the CPU time spent on String building from 2% to <= .01%

It's not much, but hey, it was easy and I knew how to do it Cheesy

Anyway, here is the latest trace I ran. (a lot is left out, just is just the top 90% of cpu time)

Code:
CPU TIME (ms) BEGIN (total = 2239712) Fri May 20 19:40:42 2011
rank   self  accum   count trace method
   1 17.47% 17.47%      21 306858 java.lang.Object.wait
   2 17.46% 34.94%     828 306869 java.lang.ref.ReferenceQueue.remove
   3 16.52% 51.46%      16 319564 sun.net.www.http.KeepAliveCache.run
   4 15.74% 67.20% 7513448 319281 java.nio.DirectByteBuffer.getInt
   5  4.05% 71.25%     210 318093 java.net.SocketInputStream.read
   6  2.81% 74.05%   29347 319369 org.lwjgl.opencl.CL10.clEnqueueReadBuffer
   7  2.70% 76.75% 7513448 319278 java.nio.Buffer.checkIndex
   8  2.69% 79.44% 7513448 319279 java.nio.DirectByteBuffer.ix
   9  2.64% 82.08% 7513448 319280 java.nio.DirectByteBuffer.getInt[quote author=DiabloD3 link=topic=1721.msg131499#msg131499 date=1305890891]
[quote author=jedi95 link=topic=1721.msg131287#msg131287 date=1305878829]
[quote author=DiabloD3 link=topic=1721.msg131220#msg131220 date=1305875424]
[quote author=DustinEwan link=topic=1721.msg131215#msg131215 date=1305875141]
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now :)

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...
[/quote]

I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.
[/quote]

The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.

It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.
[/quote]

Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed.

So... ymm so fucking v.
[/quote]
  10  1.80% 83.88%  675014 319312 org.lwjgl.opencl.CL10.clSetKernelArg
  11  1.36% 85.24%  675014 319313 org.lwjgl.opencl.InfoUtilFactory$CLKernelUtil.setArg
  12  1.01% 86.25%  675015 319298 java.lang.ThreadLocal.get
  13  1.00% 87.26%  675016 311203 java.lang.ThreadLocal$ThreadLocalMap.getEntry
  14  0.98% 88.24%  675015 319302 java.nio.DirectIntBufferU.put
  15  0.68% 88.92%   29348 319351 org.lwjgl.opencl.CL10.clEnqueueNDRangeKernel
  16  0.63% 89.55%  675015 319307 org.lwjgl.PointerWrapperAbstract.getPointer
  17  0.63% 90.18%  675012 319315 java.lang.ThreadLocal$ThreadLocalMap.access$000
  18  0.62% 90.80%  675015 319305 org.lwjgl.BufferChecks.checkBufferSize

Now I've started looking at some of the bigger stuff.  The first 2 lines are from the garbage collector, so you can see that ~35% of the CPU time was spent on just garbage collecting, 17% of which was spent just blocking all the execution threads in order to do so.  So I'm trying to figure out ways to improve that.

I don't really think that the netcode can be much faster, but another ~20% of cpu time is spent on that.  So if the netcode can be improved, that will get us back into the kernel faster.  The third line there is the thread that is used for keeping the HTTP 1.1 session alive.  I don't know much about that, but maybe it's a lead.

Anyway, I'm done for now.  Here is the new DiabloMiner.java with the new string builder.

Also:
So... ymm so fucking v.

I totally agree with that, but I love your code and bitcoin is fascinating.  So digging through this code is a great joy for me!  Great work so far man, and in Java too!  Grin
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 06:28:11 AM
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now Smiley

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...

I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.

The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.

It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.

Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed.

So... ymm so fucking v.
full member
Activity: 219
Merit: 120
May 20, 2011, 03:07:09 AM
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now Smiley

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...

I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.

The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.

It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.
full member
Activity: 126
Merit: 100
May 20, 2011, 02:54:18 AM
two 5870s, CC 11.5, SDK 2.1, on Debian testing.

i don't know yet how much faster it is than your pre-BFI_INT release.

but a lot.

i'm putting in some extra fans and a rheostatic fan speed controller - it's so damn fast that i have to clock it down right now to keep temps under 85.

so going from the old version, max volted at 300 MemClock and 900 GPUClock, to the new version down-volted by almost 0.2, MemClock at 315 and GPUClock at 850; i picked up a bit over 100 Mh/s.

i'll have the new fans and controller in tomorrow.  i have another box that i've experimented with fans on - just a single 5870, but i've learned a bit.  i'm hoping for a maxed-out setup on the dual box, running at well under 75 degrees.  we'll see.

At stock 850, 2 5870 should be in the neighborhood of 740 using -v 2 -w 128 on SDK 2.1.

BFI_INT adds around 10%.

pretty much.

i'm getting 746-748.

i'm hoping that once i get the voltage back up, and the GPUClock at 900 again, i'll be somewhere considerably closer to 800Mh/s.

by the way, Diablo - do you agree with the formula (picked up somewhere on this forum...) that the sweet spot for MemClocks is very close to:

GPUClock/3 + 14

?
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 02:40:28 AM
Update: Added all of Dustin's suggestions, and also added a timeout for non-LP connections.
newbie
Activity: 14
Merit: 0
May 20, 2011, 02:12:59 AM
I completely agree with you... I've looked at both code and it's almost line for line exactly the same...

I tried looking for other SHA256 algorithms, just in case anybody had come up with something clever besides the norm, but there's nothing out there really... in the cpu world Crypto++ is king and that's pretty much it..
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 02:10:24 AM
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now Smiley

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...

I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.
newbie
Activity: 14
Merit: 0
May 20, 2011, 02:05:41 AM
I got the profiler working... that was a lot easier than I thought it would be.  I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.

Anyway, I'm running the first batch of samples now Smiley

Are you going to be modifying the kernel much?  I'm curious as to how phatk reduced the operation count by that amount...
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
May 20, 2011, 02:02:19 AM
two 5870s, CC 11.5, SDK 2.1, on Debian testing.

i don't know yet how much faster it is than your pre-BFI_INT release.

but a lot.

i'm putting in some extra fans and a rheostatic fan speed controller - it's so damn fast that i have to clock it down right now to keep temps under 85.

so going from the old version, max volted at 300 MemClock and 900 GPUClock, to the new version down-volted by almost 0.2, MemClock at 315 and GPUClock at 850; i picked up a bit over 100 Mh/s.

i'll have the new fans and controller in tomorrow.  i have another box that i've experimented with fans on - just a single 5870, but i've learned a bit.  i'm hoping for a maxed-out setup on the dual box, running at well under 75 degrees.  we'll see.

At stock 850, 2 5870 should be in the neighborhood of 740 using -v 2 -w 128 on SDK 2.1.

BFI_INT adds around 10%.
Pages:
Jump to: