Pages:
Author

Topic: Large Bitcoin Collider (Collision Finders Pool) - page 37. (Read 193484 times)

full member
Activity: 177
Merit: 101
Before i go nuts and try to reach top30, is there a guesstimate for how a GTX 750TI will perform with LBC ?

Will it make sense to try and enter top30, or to shell out 0.1 BTC ?

My current setup is a i7-4770, 16GB RAM and a Palit GTX 750TI.

At the moment this configuration will give you ~ 3 times the performance compared what you get now with the CPU-only generator.

My suggestion would be to shell out 0.01 BTC for some AWS code ($20 or more) and to throw some AWS compute instance on the "top30 problem".
As of now, this is still possible. If 10 people do it, it may not.

Oh and because the question has come up: Once in top30 - always GPU-authorized=yes, the authorization will not go away should you fall out of the top30 again.

Rico


Thanks for answering, but i'll stick to CPU-only for now. Not willing to spend any $ or BTC for less than 10x speed gain.

CPU-only performance is quite awesome, one core on i7-4770 does 720000 keys/second.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
Before i go nuts and try to reach top30, is there a guesstimate for how a GTX 750TI will perform with LBC ?

Will it make sense to try and enter top30, or to shell out 0.1 BTC ?

My current setup is a i7-4770, 16GB RAM and a Palit GTX 750TI.

At the moment this configuration will give you ~ 3 times the performance compared what you get now with the CPU-only generator.

My suggestion would be to shell out 0.01 BTC for some AWS code ($20 or more) and to throw some AWS compute instance on the "top30 problem".
As of now, this is still possible. If 10 people do it, it may not.

Oh and because the question has come up: Once in top30 - always GPU-authorized=yes, the authorization will not go away should you fall out of the top30 again.

Rico
full member
Activity: 177
Merit: 101
Before i go nuts and try to reach top30, is there a guesstimate for how a GTX 750TI will perform with LBC ?

Will it make sense to try and enter top30, or to shell out 0.1 BTC ?

My current setup is a i7-4770, 16GB RAM and a Palit GTX 750TI.

thanks
legendary
Activity: 1120
Merit: 1037
฿ → ∞
So you're saying we are too slow? You are right, but it's not very motivational.

from this

Dual Intel Xeon E5-2690 v3 (2.60GHz) 24 Cores
64GB RAM
NVIDIA Tesla K80
GPU: 2 x Kepler GK210

with vanitygen I was getting around 150Mkeys ...

Which were only hashed to compressed addresses. Plus it does not check 9 M addresses for each generated key(!).
We're doing both uncompressed and compressed, so to be fair, when LBC will show 75Mkeys on this configuration, it will be technically as fast as oclvanitygen, but doing more work.

Our problem is still the ECC which happens on the CPU. Right now, we have a CPU/GPU hybrid. That is

  • The CPU computes 4096 uncompressed public keys and moves them to GPU
  • The GPU computes 4096 hash160 of this and 4096 hash160 of the compressed equivalents
  • The 8192 hashes are moved back to the CPU which performs a bloom filter search on them.

This process is done 4096 times before you see a 'o' on your screen. The bloom checking is negligible and the CPU could easily follow the GPU here.
The ECC is the problem. Of the 7.5 seconds for the 16Mkeys on my computer, 6.2 seconds are ECC.

I'm working on it, and we will see again (tremendous) speedups in the future. Until then the best motivation to make it happen is basically

"Yay! We are faster! Cheers!"  - after a 300% speedup (of Go generator, which was some 1000 times faster than wget/100x faster than vanitygen parsing
"Yay! We are faster! Cheers!"  - after a 1300% speedup (by using brainflayer)
"Yay! We are faster! Cheers!"  - after a 50% speedup by optimizing/rewriting the brainflayer code for almost 3 months.
"Yay! We are faster! Cheers!"  - after a 250% speedup by using the GPU as hash160 coprocessor
"Yay! We are faster! Cheers!"  - after a 20% speedup by optimizing the CPU/GPU hybrid more

=> We are today about 150x faster than the 1st LBC generator in July 2016, My notebook alone delivers 25x the keyrate than the whole pool upon inception.
=> We are on our way to a GPU generator only or something well balanced using 100% of the GPU (and efficiently)

So if we get ECC from 6.2s to - say - 1s, the configuration above will make around 90 Mkeys/s
I'm quite confident, that based on arulberos work and research, we can move quite a bit towards this goal.
Until then, what we've got is the best we've got.


Rico
member
Activity: 62
Merit: 10
from this

Dual Intel Xeon E5-2690 v3 (2.60GHz) 24 Cores
64GB RAM
NVIDIA Tesla K80
GPU: 2 x Kepler GK210

with vanitygen I was getting around 150Mkeys ...
legendary
Activity: 1120
Merit: 1037
฿ → ∞
Good morning.

Yesterday the LBC project moved quite a bit forward. I worked hard all day to test and code and so finally there is a GPU client which will be available to eligible users soon. Very soon.

@SlarkBoy
Quote
OpenCL diagnostics written.
GPU authorized: yes
Will use 4 CPUs.
Best generator chosen: gen-hrdcore-gpu-linux64
New generator found. (DL-size: 0.72MB)
Benchmark info not found - benchmarking... ./gen-hrdcore-gpu-linux64: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./gen-hrdcore-gpu-linux64)
Couldn't find the program file: No such file or directory
done.
Your maximum speed is 89335072 keys/s per CPU core.
Ask for work... Server doesn't like us. Answer: toofast.

Your client probably updated already to 1.015 - this is the 1st version to choose a GPU-assisted generator if all the prerequisites for it are met:

  • You put the --gpu flag on command line (if you don't it will still use the regular CPU generator)
  • You are in the top30 or have the GPU-eligible flag set
  • You have an AVX2 capable CPU
  • Your OpenCL environment is installed

There are still some things missing (like the OpenCL source code - that's why you see that error). After working almost 16 hours straight yesterday, I had to stop at some point.
I intend to have it all working this weekend.

Good news for the client is, the race condition is gone and it can now handle multiple GPUs in a system.

Bad news for those using AMD GPUs is: The client will only look out for Nvidia hardware. There is no technical reason for this. It's just that nobody with AMD hardware sent me a diagnostics file and I will not enable AMD support if untested. I have an AMD GPU machine here myself, but it's windows only and I have to install Linux on it 1st. After I have done and tested that, I will enable it.

@CjMapope
Quote
edit:  it ran for a while then said  " so you want to play hard, sucker? yes, ok .. bye" and died.  man i love this server hahaha.
error must be on my end i think ;p maybe an update

What you observed is the 2nd line of defense the client has in place to cope with code tampering. Normally it computes a checksum of its source code and sends that to the server which has a database entry which version has which checksum. If you tamper with the code, it will simply say so and block communication. Now if you dig deeper and change the code providing that checksum, you have tampered with the code and the client sends the "correct" checksum to the server. There is a 2nd mechanism in place to prevent that and that's what you have seen. Please do not change the code of the client - it's really not worth it.


@unknownhostname
Quote
whats the Key rate for a p2.16xlarge ?

I managed to get 22 Mkeys/s from a p2.8xlarge - for a while, after having worked hard to put the p2.8xlarge on life support. And it crashed eventually again.  Really - these machines are utter shit. And it crashed eventually again. $2 per hour? Bah. And for the regions I have looked up, Amazon wants $144 per hour for the ps.16xlarge. Srsly? In a perfect world you should get 44 Mkeys/s from a p2.16xlarge.
As I said, the best AWS machine for LBC is currently still the m4.16xlarge which gives you 18 Mkeys/s for $0.4 per hour.

Quote
As well for :
Dual Intel Xeon E5-2690 v3 (2.60GHz) 24 Cores
64GB RAM
NVIDIA Tesla K80
GPU: 2 x Kepler GK210

That looks way better.  My estimate is 2.5 to 3 times the speed you get from the CPU client on that machine. Should be 25 to 30 Mkeys/s.

Quote
..24 Haswell cores +
NVIDIA Tesla M60
GPU: 2 x Maxwell GM204

About the same speed, maybe slightly faster, but the GPUs being less under load. The CPUs are still a limiting factor here.



Rico
member
Activity: 114
Merit: 11
OpenCL diagnostics written.
GPU authorized: yes
Will use 4 CPUs.
Best generator chosen: gen-hrdcore-gpu-linux64
New generator found. (DL-size: 0.72MB)
Benchmark info not found - benchmarking... ./gen-hrdcore-gpu-linux64: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by ./gen-hrdcore-gpu-linux64)
Couldn't find the program file: No such file or directory
done.
Your maximum speed is 89335072 keys/s per CPU core.
Ask for work... Server doesn't like us. Answer: toofast.


next run:
OpenCL diagnostics written.
GPU authorized: yes
Will use 4 CPUs.
Best generator chosen: gen-hrdcore-gpu-linux64
Ask for work... Server doesn't like us. Answer: toofast.


wow 80 Mkeys/s ?  Grin
legendary
Activity: 1820
Merit: 1092
~Full-Time Minter since 2016~
ASK FOR WORK.... DEATH KISS

?? Cheesy

(Searched the thread and site couldn't find a previous example of this)

edit:  it ran for a while then said  " so you want to play hard, sucker? yes, ok .. bye" and died.  man i love this server hahaha.
error must be on my end i think ;p maybe an update
edit 2: i fixed it, apparently the client self destructed (due to my "death wish"?) so i just remade the whole thing, im back colliding! Smiley
member
Activity: 62
Merit: 10
whats the Key rate for a p2.16xlarge ?

As well for :

 Dual Intel Xeon E5-2690 v3 (2.60GHz)
24 Cores
64GB RAM
NVIDIA Tesla K80
GPU: 2 x Kepler GK210
Memory: 24GB GDDR5
Clock Speed: 2.5 GHz
NVIDIA CUDA Cores: 2 x 2496
Memory Bandwidth:
 2 x 240GB/sec


and same with


NVIDIA Tesla M60
GPU: 2 x Maxwell GM204
Memory: 16GB GDDR5
Clock Speed: 2.5 GHz
NVIDIA CUDA Cores: 2 x 2048
Memory Bandwidth:
 2 x 160GB/sec
legendary
Activity: 1120
Merit: 1037
฿ → ∞
So it seems I finally managed to eliminate the race condition (clFlush and clFinish are the OpenCL programmer friends)
LBC is as stable as never before!  Smiley

Then I thought: "Hey! Why not make the GPU device choice a CLI parameter?" So I managed to
start 4 LBC instances, each taking 8 CPUs and a different GPU on p2.8xlarge.

Code:
ubuntu@ip-172-31-32-72:~/collider$ ./LBC -c 8 -t 1 -gdev 1
Ask for work... got blocks [406251993-406252760] (805 Mkeys)
...next window...
ubuntu@ip-172-31-32-72:~/collider$ ./LBC -c 8 -t 1 -gdev 2
Ask for work... got blocks [406253049-406253816] (805 Mkeys)
...next window...
ubuntu@ip-172-31-32-72:~/collider$ ./LBC -c 8 -t 1 -gdev 3
Ask for work... got blocks [406253817-406254584] (805 Mkeys)
...next window...
ubuntu@ip-172-31-32-72:~/collider$ ./LBC -c 8 -t 1 -gdev 4
Ask for work... got blocks [406254585-406255352] (805 Mkeys)

Theoretically, this should give me 32 Mkeys/s (edit: actually a p2.x8large gives right now 22 Mkeys/s) but after 20 seconds:



LBC vs. AWS 1:0
ok, reboot and 2nd try



LBC vs. AWS 2:0

Code:
top - 16:34:57 up 5 min,  5 users,  load average: 924.34, 384.11, 144.98

At the moment I have no evidence this would be some software fault on LBCs' side.

Yeah - if you give the instance time (slowly ramp up work), and install LBC in the ramdisk(!), then you can manage to have a working multi-GPU instance.

Code:
ubuntu@ip-172-31-32-72:~/collider$ nvidia-smi
Thu Feb 16 17:54:20 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   68C    P0    84W / 149W |    513MiB / 11439MiB |     42%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   55C    P0    92W / 149W |    513MiB / 11439MiB |     40%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   71C    P0    78W / 149W |    513MiB / 11439MiB |     42%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   55C    P0    87W / 149W |    513MiB / 11439MiB |     43%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   42C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   36C    P8    31W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   40C    P8    26W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   34C    P8    30W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     26712    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26713    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26730    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26732    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26733    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26734    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26735    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0     26746    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26738    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26739    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26749    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26750    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26807    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26815    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26823    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    1     26831    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26586    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26588    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26589    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26590    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26591    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26616    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26617    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    2     26618    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26599    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26601    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26603    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26606    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26607    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26608    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26609    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    3     26610    C   ./gen-hrdcore-avx2-linux64                      64MiB |
+-----------------------------------------------------------------------------+

I'm seriously thinking about offering pre-installed LBC clients. This AWS crap is unbearable.  Wink

Rico
legendary
Activity: 1120
Merit: 1037
฿ → ∞
As estimated, 10 cores can saturate 1 K80

Code:
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   48C    P0   104W / 149W |    641MiB / 11439MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    7      2219    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2221    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2223    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2225    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2228    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2229    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2230    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2232    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2233    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    7      2234    C   ./gen-hrdcore-avx2-linux64                      64MiB |
+-----------------------------------------------------------------------------+

resulting in about 12 Mkeys/s

Code:
ubuntu@ip-172-31-32-72:~/collider$ ./LBC -c 10 -t 1 -l 0
Ask for work... got blocks [405667481-405668440] (1006 Mkeys)
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo (12.07 Mkeys/s)

So a p2.8xlarge will give you around 8 times the keyrate of the p2.xlarge for - at least - 8 times the price of the p2.xlarge.
So not entirely satisfied...

Rico

legendary
Activity: 1120
Merit: 1037
฿ → ∞

economic considerations:

Really? You've finally decided this "project" needs some economic considerations after 23 pages of enthusiastic code churning?

becoin - as always... It's not the "project" that needs economic considerations, but anyone who wants to get in the top30 for getting a GPU client and not forking out 0.1 BTC (or 0.5 BTC if he's becoin).

Right now, you can still get in the top30 for around $11 (~28 hours) with a m4.x16 AWS spot instance. To achieve the same with the p2.xlarge would cost you $33.


Apropos churning:

I made a workaround in the LBC client to stop the generator when it is churning bad hashes:

Code:
Ask for work... got blocks [405316777-405317288] (536 Mkeys)
oooooooooooooooooooooooooooooooo (6.68 Mkeys/s)
Ask for work... got blocks [405317817-405318328] (536 Mkeys)
oooooooooooooooooooooooooooooooo (6.51 Mkeys/s)
Ask for work... got blocks [405318361-405318872] (536 Mkeys)
ooooooooooooooooooGenerator churning bad hits! Abort.
20 just got out of the pool with exit code: 255 and data:
ooooooooooooomalformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "HASH(0x3e5cca8)") at ./LBC line 1176.

It's not nice, but until I find a real fix, this at least prevents flawed PoW proliferating into the done blocks.


Rico
legendary
Activity: 3431
Merit: 1233

economic considerations:


Really? You've finally decided this "project" needs some economic considerations after 23 pages of enthusiastic code churning?

legendary
Activity: 1120
Merit: 1037
฿ → ∞
https://twitter.com/LBC_collider

GPU's arent ... you should try GPU ... I'm sure you can delivered great speed with GPU

even with 1 server I think I can triple the pool speed.

root@soft:~# lshw -C video | grep product:
       product: ASPEED Graphics Family
       product: GK210GL [Tesla K80]
       product: GK210GL [Tesla K80]
       product: GK210GL [Tesla K80]
       product: GK210GL [Tesla K80]

Code:
ubuntu@ip-172-31-34-146:~/collider$ ./LBC -c 4 -l 0 -t 1
Benchmark info not found - benchmarking... done.
Your maximum speed is 1576126 keys/s per CPU core.
Ask for work... got blocks [405066137-405066520] (402 Mkeys)
oooooooooooooooooooooooo (3.19 Mkeys/s)
ubuntu@ip-172-31-34-146:~/collider$ ./LBC -c 2 -l 0 -t 1
Ask for work... got blocks [405077529-405077720] (201 Mkeys)
oooooooooooo (2.78 Mkeys/s)

Clearly, Amazon puts way too few/too weak CPUs in their Instances - for our usecase.
What surprises me more, is that the K80 does not look so impressive compared with my tiny Notebook GPU:

Code:
ubuntu@ip-172-31-34-146:~$ nvidia-smi
Thu Feb 16 11:23:38 2017      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   55C    P0    76W / 149W |    256MiB / 11439MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1938    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0      1939    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0      1940    C   ./gen-hrdcore-avx2-linux64                      64MiB |
|    0      1941    C   ./gen-hrdcore-avx2-linux64                      64MiB |
+-----------------------------------------------------------------------------+


With the 4 vCPUs in use. Clearly , 4 vCPUs in Amazon speak mean 2 real cores + 2HT

versus my real 4 CPUs:

Code:
$ nvidia-smi
Thu Feb 16 12:36:36 2017      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M2000M       Off  | 0000:01:00.0     Off |                  N/A |
| N/A   51C    P0    N/A /  N/A |    115MiB /  4041MiB |     33%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     21809    C   ./gen-hrdcore-skylake-linux64                   28MiB |
|    0     21810    C   ./gen-hrdcore-skylake-linux64                   28MiB |
|    0     21811    C   ./gen-hrdcore-skylake-linux64                   28MiB |
|    0     21839    C   ./gen-hrdcore-skylake-linux64                   28MiB |
+-----------------------------------------------------------------------------+

I end up at almost 7 Mkeys/s with my 4 CPUs. Moreover, not only is the memory usage more efficient (ok - the K80 has 3 times the memory, but it also slurps - for reasons unknown to me - about 2.5 times per process), also the relative utilization is in favor of my notebook. If Amazon offered a P2 instance with 20vCPUs and 1 K80 -> that would be balanced and at least 30 Mkeys/s could be expected from that.
Als a good (in terms of balance) configuration: 12 real Skylake cores and some reasonable Maxwell (GM107) GPU -> should give you 23+ Mkeys/s

On the more positive side, GPU detection and choice of OpenCL device ran flawless on 1st try.


Rico

edit installation howto for OpenCL on Ubuntu 16.04 (as used on AWS):

Code:
# OpenCL @ Amazon AWS Ubuntu ----------------------------------

sudo apt-get install gcc make tmux libssl-dev xdelta3 nvidia-367 nvidia-cuda-toolkit
mkdir collider; cd collider; tmux
wget ftp://ftp.cryptoguru.org/LBC/client/LBC
chmod a+x LBC
sudo ./LBC -h
sudo cpan
cpan> install OpenCL
sudo reboot
sudo nvidia-smi -pm 1
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875
./LBC -x
./LBC --gpu

economic considerations:

At the moment AWS GPU instances are not economical. For 0.25/h you can get the p2.xlarge and it will give you max 3.2 Mkeys/s. OTOH, you can get for 0.5/h a m4.x16 compute instance with 64 vCPUs and that will give you around 18 Mkeys/s. Yes - we need a better GPU client.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
Seems I catched some race condition after my async modifications. When I came to my notebook today morning, I saw

Code:
...lots of work done, but then ...
Ask for work... got blocks [403243609-403246040] (2550 Mkeys)
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo0000000000000000000000000000000000000000:u:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0
0000000000000000000000000000000000000000:c:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0
0000000000000000000000000000000000000000:u:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x1
0000000000000000000000000000000000000000:c:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x1
0000000000000000000000000000000000000000:u:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x2
0000000000000000000000000000000000000000:c:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x2
0000000000000000000000000000000000000000:u:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x3
0000000000000000000000000000000000000000:c:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x3
0000000000000000000000000000000000000000:u:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x4
0000000000000000000000000000000000000000:c:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x4
0000000000000000000000000000000000000000:u:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0x5
...and so on...

thousands of "finds" of a 000000 hash160. And then

Code:
197f1706f2aa45480c1debc40628c87823da08f6:c:priv:000000000000000000000000000000000000000000000000000180909f801001 + 0xd2

Naturally, I looked up 197f1706f2aa45480c1debc40628c87823da08f6, which resolves to https://blockchain.info/address/13Kp9AJAxhEEjFo8N6YTP9DMW71YpK2fD9, but no funds there. Ok, that can happen if the bloom filter sees a false positive (allegedly 10-27 probability), but a re-run in the same search space went smooth with neither any fake zero-hash160 finds nor this false positive.

Investigating, but it seems like clEnqueueReadBuffer does not respect a blocking read, after it has been called with non blocking reads before.  Undecided

I have done some more optimizations, but all I managed to do, was that the GPU load went down from 43% to 34%  Tongue I need to take load down from the CPU!  Roll Eyes


Rico
legendary
Activity: 1120
Merit: 1037
฿ → ∞
Quote
Of course I also checked the client with all blocks containing private keys the pool has found so far - it reliably finds all of them.

Can you explain this further please? How do you know which blocks contains keys the pool has found?

https://lbc.cryptoguru.org/trophies

What I meant was, that additionally to the usual ./LBC -x I also searched manually in spaces where the known private keys of the puzzle transaction are (all compressed) and also the two addresses we found with funds on them (which are uncompressed).

The new CPU/GPU hybrid found all of them, so I assume it is a working drop in replacement.
Testing the LBC is crucial, because when you have rare events like we have, you cannot afford to have a generator that overlooks something.
If your computer works for a month without a find, you have to be pretty sure it is because there really was nothing and not that because of some bug your client "oversaw" something. So that's basically what my test (and the statement) was about.


Rico
full member
Activity: 149
Merit: 100
Quote
Of course I also checked the client with all blocks containing private keys the pool has found so far - it reliably finds all of them.

Can you explain this further please? How do you know which blocks contains keys the pool has found?

If someone found a valid key and just let it go and kept colliding, would you know about it?
legendary
Activity: 1948
Merit: 2097
I am performing some tests about endomorphism.

I remind the idea, we would like to generate:

a) 1G,  2G,  3G,  ......., kG, .......... , 2^160G
b) 1G',  2G', 3G', ......., kG', .........., 2^160G'    where G'=lambdaG
c) 1G'', 2G'', 3G'', ......., kG'',.........., 2^160G''   where G''=lambda^2G

We are sure that each row has different elements, because G, G', G'' have period n. But of course we cannot be sure that each element of b) for example is not an element of a) too. If we generated n keys instead of just 2^160, we would get the entire group of all n points, and then all the 3 rows would have the same elements. Only the order would be different.

But we have to generate only "few" elements.
Let's look at the rows a) and b) and at the relation between 2 corresponding elements: kG' = k*(lambdaG) = lambda*(kG). Where are these elements of b)?

My guess is:
multiplication by lambda produces 2^160 elements of b) evenly distributed in the space of the keys (keys respect of the generator G).

If that were true, how often would we have a "collision" (double computation of the same key in 2 distinct rows) between the 2 rows?
If the keys of the b) row are actually evenly distributed, the probability for each new key of b) to fall in the range 1-2^160 should be 2^160/2^256, about 1/2^96. If we generated 2^160 elements, we'd have 2^64 collisions.

To deal with this hypothesis, I generated 2^30 keys of the row b) (lambda1, lambda2, lambda3, ..., lambda2^30); none of these were in the range (1,2^160), so I checked how many were in larger ranges (like for example (1,2^238), and in that case I got about 2^12 'collisions' (2^238/2^256 * 2^30 = 2^12). So my hypothesis seems to have been confirmed by these results.

In summary, since we have to generate only 2^160 keys, we can accept (but obviously it's up to you) to have a double computation for one key each 2^96, only 16 'collisions' in the first 2^100 keys.

A question remains: do you want to generate random keys outside from your initial range? In case of collision, how can somebody prove to you that it is his key, since that key is indistinguishable from the others?

If you want instead to let go of endomorphism, I remind you that your generation's speed will be halved (from 1,1 M to 2,1 M for each point).
legendary
Activity: 1120
Merit: 1037
฿ → ∞
Can't wait to get home and try the GPU version.

Think March. I have some basic quality assurance in this project.  Wink
The client basically works, but several things are still hard coded for my notebook (choice of OpenCL device).
I have no feedback (diagnostics-OpenCL.txt) from AMD GPUs yet.
Client is stable. Ran the whole night through on my notebook with 7.x Mkeys/s:



Of course I also checked the client with all blocks containing private keys the pool has found so far - it reliably finds all of them.

Quote
If I have 4 cores and 4 GPUs will it use a GPU with each core or...

Right now, one GPU would be taken as accelerator for all cores and still be bored.
Probably the best balancing one can get right now is 1 GPU and many cores

Amazon p2.xlarge or similar to my notebook. That's why I am asking for the OpenCL diagnostics files, to be able to cover a broader range of configurations.

My next step will to be to incorporate arulberos ECC magic to shift the balance by taking
more and more load from the CPU. Current status: https://twitter.com/LBC_collider

Quote
Also,  can you make it so I can run this on an Rpi?
Allow the client to run the old Go script that should suffice.

It's unlikely I will go down that path for now.
The HRD-client was originally about 13 times faster than the Go client (meanwhile >15 times), and 32bit architectures are on average half the speed of 64bit.
I do have a 32bit notebook (Lenovo Z61p), two cores, that does about 200 Kkeys/s on both cores with HRD, this notebook does around 12 Kkeys/s with the Go client.
My new notebook was originally 14 times faster with CPU only and is meanwhile over 35 times faster than the HRD on the old one. It is about 616 times faster than the Go client on the old notebook.
Also, the Go client needed more memory (2GB).

So my goal is to make a GPU client so that my current notebook (and your computer) will be x-thousand times faster than Go on the old/small machines.


Rico

legendary
Activity: 1140
Merit: 1000
The Real Jude Austin
6,2 s: CPU generates 16.7 M of public keys (x,y)
1,8 s: GPU performs SHA256 / ripemd160 of (x,y) and (x) <-compressed,

Yes.

Quote
what do you mean "compressed key is done with GPU"?

Code:
sha256_in[0] = 0x02 | (sha256_in[64] & 0x01);

 Wink


Quote
Anyway at the moment the cpu is the bottleneck, gpu does his work at least x3 faster than cpu...

Sure. It is a 1st step. The big advantage of this is, it works like a drop-in replacement.
I see lots of optimization potential, originally, my notebook maxed out at ~ 2.8 Mkeys/s and now

Code:
$ LBC -c 8
Ask for work... got blocks [383054009-383054392] (402 Mkeys)
oooooooooooooooooooooooo (7.30 Mkeys/s)


Rico






edit:


LOL...

Code:
$ LBC -t 1 -l 0
Ask for work... Server doesn't like us. Answer: toofast.

Can't wait to get home and try the GPU version.

If I have 4 cores and 4 GPUs will it use a GPU with each core or...

Also,  can you make it so I can run this on an Rpi?

Allow the client to run the old Go script that should suffice.
Pages:
Jump to: