Author

Topic: How is done GPU hostility ? (Read 1844 times)

legendary
Activity: 1428
Merit: 1093
Core Armory Developer
October 28, 2011, 10:09:13 PM
#10
i think its memory(cache) intensive code, instead of computation intensive code.
you could do it on a GPU, but if the memory is not fast enough, or all the required memory can't be in cache GPUs will not speed it up.
For today's GPUs, decision-intensive code works well too. Mixing functions like:

k ^= (j&1) ? m : n;
m ^= (j&2) ? k ? n;
n ^= (j&4) ? m ? k;

I don't know about ATI, but in CUDA you want to avoid integer division and modulo operations like the plague.  A CUDA thread can do like 32 floating point operations per clock cycle, but it takes multiple clock-cycles to do one modulo operation.   Anything heavy on the modulus operations would certainly run poorly.
legendary
Activity: 1596
Merit: 1012
Democracy is vulnerable to a 51% attack.
October 27, 2011, 10:40:22 PM
#9
i think its memory(cache) intensive code, instead of computation intensive code.
you could do it on a GPU, but if the memory is not fast enough, or all the required memory can't be in cache GPUs will not speed it up.
For today's GPUs, decision-intensive code works well too. Mixing functions like:

k ^= (j&1) ? m : n;
m ^= (j&2) ? k ? n;
n ^= (j&4) ? m ? k;

donator
Activity: 1218
Merit: 1079
Gerald Davis
October 27, 2011, 10:23:51 PM
#8
I'm having a tough time believing that ATI cards are as different as you suggest.  NVIDIA GPUs tend to share 16-48 kB of shared memory and it takes 1-4 clock cycles.  There is nothing faster on the GPU than shared mem, except for the couple registers each thread has.  

Well that isn't exactly true.  Both NVidia and AMD (no ATI) have L1 cache inside each "group" of SP.  NVdia calls is SM, AMD a Compute Unit.  This cache is usually a 1 clock cycle latency and isn't shared with other "groups" of SP and is very fast.  The only problem is the amount is insufficient to hold scrypt lookup tables which is why NVidia & AMD cards both suck at running scrypt based hashing functions.  

http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf

However NVIDIA Tesla cards have 32KB of L1 cache per SM (32 SP) which may be "enough".  That makes them more like CPU in terms of L1 cache size.  It may not make Tesla economical (given their extreme price) but they likely achieve high level of efficiency than other GPU.

Quote
However, you do make a good point:  extremely random accesses to memory changes the story quite a bit.  Global memory running 100-200x slower than shared mem usually isn't a problem because you don't do it very often, and sequential accesses can get you a 32x boost.  However, if the computation required each thread to continually/randomly access global memory, it probably would take quite a hit in efficiency.

Exactly.  BOTH NVidia and ATI GPU lack sufficient L1 cache to load entire lookup table.  The scrypt algorithm is designed to require significant number of lookups, modifications, and writes.  This means almost continually the GPU encounters a cache miss, halts, and waits up to 4 clock cycles to pull it from shared memory.  While 4 clock cycles may not seem long time remember CPU is seeking the entire lookup table continually in L1 cache, and running at up to 4x as fast.  The scrypt algorithm is computationally simple making the penalty for a cache miss very high.  Essentially the architecture which can't ensure nearly 100% cache hit rate loses.
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
October 27, 2011, 10:08:35 PM
#7
I'm having a tough time believing that ATI cards are as different as you suggest.  NVIDIA GPUs tend to share 16-48 kB of shared memory and it takes 1-4 clock cycles.  There is nothing faster on the GPU than shared mem, except for the couple registers each thread has.   

However, you do make a good point:  extremely random accesses to memory changes the story quite a bit.  Global memory running 100-200x slower than shared mem usually isn't a problem because you don't do it very often, and sequential accesses can get you a 32x boost.  However, if the computation required each thread to continually/randomly access global memory, it probably would take quite a hit in efficiency.
donator
Activity: 1218
Merit: 1079
Gerald Davis
October 27, 2011, 09:59:27 PM
#6
Shared memory is too slow to achieve the throughput necessary. It doesn't require MB of memory just KB of ultra low latency cache.  

Scrypt is an example.  It will run just fine on a CPU with only couple MB of main memory available but dies on a GPU with 2, 4, 6GB of memory.

The latency to shared memory is simply too slow.  The number of lookups for a single hash is very large.  Scrypt requires very little computation and a lot of psuedo random lookups.


GPU don't have L2/L3 cache but the do have a tiny (far tinier than CPU) ultra fast L1 cache shared between a group of SP.  It is insufficient to keep scrypt lookup tables in cache.  Thus hundreds of times per hash a cache miss occurs, the GPU stalls waiting for data to be pulled from shared memory.  A CPU can keep entire lookup table in cache and thus encounters no significant latency in one hashing operation.  The GPU simply doesn't have a chance.  
legendary
Activity: 1428
Merit: 1093
Core Armory Developer
October 27, 2011, 09:51:24 PM
#5
I'm not much of an ATI/OpenCL programmer, but I've done quite a bit of CUDA programming (see my CUDA fractal viewer), and there's a lot of similarities between the way these chips store and access memory.  

I'm not too familiar with the L1-L3 cache on GPUs -- they tend to operate with some kind of global memory (the number you see on the graphics card box), and then various banks of shared memory and graphics caches.  On most CUDA hardware, there is 16-48kB of shared memory split between 256 threads.  It's fast as hell, but that's not a lot of memory.  Since a lot of programs would need more shared memory than that, they have to use the global memory anyway:  and they still do just fine getting 5-100x speed up over CPUs.    On the other hand, requiring the GPU to store memory in your RAM, would be disastrous.  It takes like 1000x more time to access RAM than it does its own global memory (don't quote me on that number).

Therefore, in order to disarm GPUs of their dominance, I believe you really need to force them to exceed their global memory capacity.  Here's what you'll find in a high-end machine with a good CPU and a solid GPU (take 6970 for exactly)

CPU:  4 cores, 8GB RAM  (2 GB/core)
GPU:  1600 cores, 2 GB RAM (1-2 MB/core)

The issue with hashing is that it requires only a couple kB of RAM, so there's nothing stopping the GPU from hitting full occupancy (maximum parallelization).  However, if the hashing were replaced by an operation that required 100 MB RAM per thread, the GPU would only be able to use 20/1600 cores.  It'd probably be no better than the CPU.  There's ways to do this:  you could replace sha256(sha256(a)) with having to sequentially hash the string 3,000,000 times, and concatenate all the hashes into a single string (requiring 100 MB to store), then execute some operation on that string that requires the thread to store all of it in memory at once (hashing can be done in pieces, but there are other options that would require it).  

Each GPU thread would have to use global memory, and only 20 would fit.  You might as well use your CPU (which whose four cores are probably faster than the GPU's 20).  100 MB is good, because a lot of computers can spare 400 MB of RAM without crippling the computer.

donator
Activity: 1218
Merit: 1079
Gerald Davis
October 27, 2011, 03:58:52 PM
#4
GPU have very little ultra-fast cache.  For example AMD GPU share only 8K of cache between an entire Compute Unit (14 Stream Processors).

Any algorithm which requires lots of random reads larger than the cache size of processors will require the processor to idle while data is pulled from L2/L3 cache or main memory.  This is punitively slow.  The processor (GPU or CPU) will simply sit their doing absolutely nothing while it waits.

Scrypt is one example.  However there is nothing magical about it.  It looks like it requires ~32KB to function optimally.  It is entirely possible that some future GPU will have 32KB or more of L1 cache.  At which point it would utter anhilate the performance of current CPUs.

So instead of saying GPU-hostile it would be more accurate to say large cache dependent.  If future GPU have large cache it won't be hostile at all.  Likewise if a CPU has insufficient cache (like new Bulldozer is pretty horrible at scrypt based hashing despite not being a GPU).
sr. member
Activity: 270
Merit: 250
October 27, 2011, 12:47:44 PM
#3
litecoin and tenebrix both use the scrypt algorithm that sidelines gpu's.
legendary
Activity: 1050
Merit: 1000
You are WRONG!
October 27, 2011, 06:05:20 AM
#2
i think its memory(cache) intensive code, instead of computation intensive code.
you could do it on a GPU, but if the memory is not fast enough, or all the required memory can't be in cache GPUs will not speed it up.
member
Activity: 138
Merit: 25
October 27, 2011, 05:58:36 AM
#1
Which algorithm must be that GPU can't solve it in parallel way?
Jump to: