I was trying to find out if our skills were complementary. I am a complete noob when it comes
to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda.
When I provided a demonstration of my skills he respnded with sillly you that was cpu verification
code, and why don't you do better, without ever considering the technical merit or other
applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new
that sells itself. I'm afraid SP has turned into a telemarketer.
https://github.com/NervanaSystems/maxas
processir architecture in detail so I can dtetertmine things like how many loads to queue up to
fill the pipe, how many executions units, user cache management, etc. That kind of information
is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable
docs with this kind of info?
There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully. There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff...
Thanks again.
Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock?
When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache
line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill
the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs.
If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory
while the other can do the processing.
These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance
tests that could quantify the effect, good or bad.
If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself.
The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical
targets.
GDS (global memory) LDS (local memory), and work-item shuffle all require a little waiting period before they complete. So, say I'm using ds_swizzle_b32 (work-item shuffle) like I had fun with in my 4-way Echo-512... On AMD GCN, you can do some shit like so:
# This is done in place of BigShiftRows, but before BigMixColumns.
# So, my uint4 variables (in OpenCL notation) named b and c are now loaded properly without the need for shifting rows.
ds_swizzle_b32 v36, v80 offset:0x8039 # b.z
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w
ds_swizzle_b32 v38, v78 offset:0x8039 # b.x
ds_swizzle_b32 v39, v79 offset:0x8039 # b.y
ds_swizzle_b32 v15, v84 offset:0x804E # c.z
ds_swizzle_b32 v16, v85 offset:0x804E # c.w
ds_swizzle_b32 v33, v82 offset:0x804E # c.x
ds_swizzle_b32 v34, v83 offset:0x804E # c.y
# Each and every one of these takes time, however - and each one increments a little counter.
# What I can do is this - since the first row in the state is not shifted, the a variable is already ready
# It's in registers and ready to be used.
# The first thing I do in the OpenCL after loading up the proper state values - in BigMixColumns - is a ^ b.
# So, I can do something like this:
s_waitcnt lgkmcnt(4)
# What this does is, it waits on the pending operations until there are four left.
# They're queued in the order the instructions were issued - so the b uint4 should now be loaded
# Note, however, that the c uint4 is NOT guaranteed to have been loaded, and cannot be relied on (yet.)
# Now, I can process the XOR while the swizzle operation on the c uint4 is working!
v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z
v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w
v_xor_b32 v38, v74, v38 # v38 = a.x ^ b.x
v_xor_b32 v39, v75, v39 # v39 = a.y ^ b.y
# And then we can put in an instruction to wait for the c uint4 before we continue...
s_waitcnt lgkmcnt(0)
In case you're wondering, I load the d uint4 later in the code. Also, if you *really* wanna try your damndest to maximize the time spent executing compute shit during loads, you could do this (although you've probably figured it out by now):
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w
ds_swizzle_b32 v38, v78 offset:0x8039 # b.x
ds_swizzle_b32 v39, v79 offset:0x8039 # b.y
ds_swizzle_b32 v15, v84 offset:0x804E # c.z
ds_swizzle_b32 v16, v85 offset:0x804E # c.w
ds_swizzle_b32 v33, v82 offset:0x804E # c.x
ds_swizzle_b32 v34, v83 offset:0x804E # c.y
s_waitcnt lgkmcnt(7)
v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z
s_waitcnt lgkmcnt(6)
v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w
s_waitcnt lgkmcnt(5)
v_xor_b32 v38, v74, v38 # v38 = a.x ^ b.x
s_waitcnt lgkmcnt(4)
v_xor_b32 v39, v75, v39 # v39 = a.y ^ b.y
# You get the idea...
[code]
I think I follow even though that syntax is completely foreign to me. I think what you did is what
I was talking about. But I would go one step farther. It may not apply because I don't understand
the wait instructions unless there are synchronization issues.
In addition to what you did I would put the first xor on b immediately after the first load. I know
it's stalled waiting for data but I want its dependant instruction already queued for when the data
becomes available.
Secondly that first load will fill the cache line so there is no need to queue up the load instruction until
the first load completes. Susequent loads will finish immediately because they hit the cache:
What I would not do is have a string of indentical instruuctions because they all compete for the
same execution unit and can only be issued one per clock. I would interleave the swizzles and xors
to they can both be issued on the same clock, assuming all dependencies are met.
With comments:
ds_swizzle_b32 v36, v80 offset:0x8039 # b.z // start filling the cache with b
v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z // queue first xor for when b is ready
ds_swizzle_b32 v37, v81 offset:0x8039 # b.w // this will complete one clock after the previous swizzle so...
v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w // make sure we're ready for it
I think you get it. When all the B vars are loaded you can queue the C vars while still processing and
saving the first batch.
I would even go one step farther to the loading of a, if possible.
I would start with swizzle a immediately followed by swizzle b then the first xor.
There wil be a lot of stalling waiting for memory here so if there are any other trivial tasks
do them next.
Loading a & b in parallel may seem odd but once both are in the cache you're flying. Then you can
mix saving processed data and loading new data, giving priority to loads to keep the GPU hot and
you can stick in the first swizzle c early to get the data ready.
I learned some of this stuff on a company paid Motorola course. The instructor was a geek and our class
was pretty sharp so we covered the material eraly then started having fun. At the time we were in a
performance cruch with customers demanding more capacity so we focsussed on code scheduling and user
cache management. One of the more bizarre instructions was the delayed branch. It exssentially means
branch AFTER the next instruction. That next instruction was often returning the rc. It took some getting
used to but oit gives an idea of the level of optimization they were into at the time.
It's the same CPU that had the ability to mark a cache line valid without touching mem. It great for
malloc because the data is initially undefined anyway. Who cares whether the garbage comes from
mem or stale cache, it's all garbage. Imagine mallocing 1k and having it cached without ever touching
the bus. They also have an instruction to preload the cache for real. that is essentially what I was
simulating above. It also had a user flush so you could flush data at any convenient time after you
no longer needed it instead of a system initiated flush when you are stalled waiting for new data.