From my first run of profiling the miner, I saw that you were spending about 2% cpu time in just building strings (mainly StringBuilder copying char arrays internally). Using the + operator is inlined to StringBuilder, which can be pretty slow. I ran into this in my game engine here at work and had come across this post at StackOverflow from a guy that implements his own (albeit primitive) class for string concatenation.
I forgot to save the profile for that one (the profiler automatically overwrites the output file every time and I'm lazy
), but I reduced the CPU time spent on String building from 2% to <= .01%
It's not much, but hey, it was easy and I knew how to do it
Anyway, here is the latest trace I ran. (a lot is left out, just is just the top 90% of cpu time)
CPU TIME (ms) BEGIN (total = 2239712) Fri May 20 19:40:42 2011
rank self accum count trace method
1 17.47% 17.47% 21 306858 java.lang.Object.wait
2 17.46% 34.94% 828 306869 java.lang.ref.ReferenceQueue.remove
3 16.52% 51.46% 16 319564 sun.net.www.http.KeepAliveCache.run
4 15.74% 67.20% 7513448 319281 java.nio.DirectByteBuffer.getInt
5 4.05% 71.25% 210 318093 java.net.SocketInputStream.read
6 2.81% 74.05% 29347 319369 org.lwjgl.opencl.CL10.clEnqueueReadBuffer
7 2.70% 76.75% 7513448 319278 java.nio.Buffer.checkIndex
8 2.69% 79.44% 7513448 319279 java.nio.DirectByteBuffer.ix
9 2.64% 82.08% 7513448 319280 java.nio.DirectByteBuffer.getInt[quote author=DiabloD3 link=topic=1721.msg131499#msg131499 date=1305890891]
[quote author=jedi95 link=topic=1721.msg131287#msg131287 date=1305878829]
[quote author=DiabloD3 link=topic=1721.msg131220#msg131220 date=1305875424]
[quote author=DustinEwan link=topic=1721.msg131215#msg131215 date=1305875141]
I got the profiler working... that was a lot easier than I thought it would be. I haven't done too much Java outside of Google's DalvikVM, but it's not a true Java implementation so some things are done a little bit differently.
Anyway, I'm running the first batch of samples now :)
Are you going to be modifying the kernel much? I'm curious as to how phatk reduced the operation count by that amount...
[/quote]
I did a lot of examining of phatk. I can't tell where he thinks hes saving cycles. Not only that, it runs exactly the same on SDK 2.1 and SDK 2.4 on my 5850 vs phoenix's standard kernel. Plus, if he is in fact exploiting anything, it probably isn't exploiting it as much as -v 3 -w 128 on mine on 69xx.
[/quote]
The key difference is not in the total number of instructions executed, but that they make better use of the 5-wide ALU design. Have a look at the ASM generated with AMD's KernelAnalyzer. Particularly the number of ALU ops. It's no faster than the poclbm kernel on 2.1, but for most people it eliminates the speed disadvantage of SDK 2.4.
It's also designed with VLIW5 in mind, so it's obviously not going to be optimal on VLIW4 hardware.
[/quote]
Well the big problem is on 2.4 phoenix-poclbm and phatk give near identical results... and both are still slower than real poclbm on both 2.1 and 2.4. And -v 18 and 19 give interesting results on 58xx on 2.4 which beats phatk's lackluster speed.
So... ymm so fucking v.
[/quote]
10 1.80% 83.88% 675014 319312 org.lwjgl.opencl.CL10.clSetKernelArg
11 1.36% 85.24% 675014 319313 org.lwjgl.opencl.InfoUtilFactory$CLKernelUtil.setArg
12 1.01% 86.25% 675015 319298 java.lang.ThreadLocal.get
13 1.00% 87.26% 675016 311203 java.lang.ThreadLocal$ThreadLocalMap.getEntry
14 0.98% 88.24% 675015 319302 java.nio.DirectIntBufferU.put
15 0.68% 88.92% 29348 319351 org.lwjgl.opencl.CL10.clEnqueueNDRangeKernel
16 0.63% 89.55% 675015 319307 org.lwjgl.PointerWrapperAbstract.getPointer
17 0.63% 90.18% 675012 319315 java.lang.ThreadLocal$ThreadLocalMap.access$000
18 0.62% 90.80% 675015 319305 org.lwjgl.BufferChecks.checkBufferSize
Now I've started looking at some of the bigger stuff. The first 2 lines are from the garbage collector, so you can see that ~35% of the CPU time was spent on just garbage collecting, 17% of which was spent just blocking all the execution threads in order to do so. So I'm trying to figure out ways to improve that.
I don't really think that the netcode can be much faster, but another ~20% of cpu time is spent on that. So if the netcode can be improved, that will get us back into the kernel faster. The third line there is the thread that is used for keeping the HTTP 1.1 session alive. I don't know much about that, but maybe it's a lead.
Anyway, I'm done for now.
Here is the new DiabloMiner.java with the new string builder.
Also:
So... ymm so fucking v.
I totally agree with that, but I love your code and bitcoin is fascinating. So digging through this code is a great joy for me! Great work so far man, and in Java too!