so basically you expect other dev to work on the opensource so to make your private miner faster
pallas says thank you to you
I will study your new optimized Neoscrypt code and learn.
You should study my code as well. We use different ways to reach our goal. I do small compiler/assembly optimalizations while you are reinventing the algorithm. You should buy my private miner and analyze it. I will give you a discount of 0.05 BTC.
I'm aware of a couple of assembly code optimizing techniques I've used before but require an initimate knowledge
of the CPU (GPU in this case) architecture including memory interface, cache organization and execution environment.
I intend to give this a try with cpuminer-opt once I get up to speed on Intel architecture. Would you be interested in
doing it for cuda? I could explain the details.
yeah because it is well known that we just do random stuff
LOL. Trial and error also works sometimes, the key is to figure out exactly why it works.
In my teaser ity's about code scheduling to maximize throughput by avoiding processor stalls
and maximizing superscalar operation.. It's not something that compilers can do easilly
because it requires so much anlysis and detailed understanding of the operation of
a CPU.
You may think I'm joking or boasting but I'm willing to discuss it openly and be subject to humiliation.
You may also wonder why I would do this. If the techniques work I would ask the developpers to
open some of their private code. Also I want to leverage the cuda expertise available to improve
the product for everyone.
I'll post one technique in a while that requires CPU support, but I have to dig out my old processor manuals
for a refresher. This one is something compilers should do since it is documented.
Edit: I've reviewed my manuals and checked out haswell optimization maual online and didn't see
anything at first glance that indicates they have the support to do the following:
allocate load
A special form of the load instruction will cause a line of cache to be allocated without accessing memory to fill it.
This is usefull when allocating mem and you don't care what data is in it. Memory isn't accessed unless the cache line
gets flushed for other reasons. An if the buffer is used for a short time it may never need to access memory.
Example:
uint32_t* p = allocate_load( size )
crunch some data
free_and_invalidate_cache ( &p )
even better lock the cache line to guarantee it never gets flushed, effectively expanding the
register set.
struct * my_struct my_regs = alocate_load _and_lock_line( size )
do stuff with my_regs.r1 etc
free_unlock_and_invalidate_cache( &my_regs );
What do you think?