Intel 50-core Knights Corner | Bitcointalksearch.org

ArsenShnurkov

legendary

Activity: 1386

Merit: 1000

Пepeимeнoвaли в
Xeon Phi
HPC Coprocessor

Intel’s James Reinders wrote that the Knight’s Corner open source software stack consists of an embedded Linux, a minimally modified GCC, and driver software (there’s also a package for GDB available separately). Reinders writes that the software stack – officially known as the Intel® Many Integrated Core (MIC) Platform Software Stack (MPSS) – is dependent on the 2.6.34 Linux kernel, and has been tested to work with specific versions of 64-bit Red Hat Enterprise 6.0, 6.1, and 6.2; and also SuSE Linux Enterprise Server (SLES) 11 SP1.

http://software.intel.com/en-us/blogs/2012/06/05/knights-corner-open-source-software-stack/

AzN1337c0d3r

full member

Activity: 238

Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Quote from: gat3way on June 12, 2012, 04:24:23 PM

Did increasing the threads count beyond the number of CPUs improve performance because you somehow "utilized" the cores better? Or did just the opposite happen because all you did was introducing scheduling contention?

When the thread count was somewhat greater than the core count, I utilized the cores better.

Quote

You can decouple the fetching and decoding from the execution. Instructions do not execute until they are ready.

Quote

Again, pipelines can't help much in the situation where an instruction depends on the result of a previous one. SHA256 has 64 steps, and each step depends on the result of the previous one. Now there are a number of independent instructions within each step, this is not unlimited though

1. The number of independent instructions is much much greater than the number of steps. So as far as dependencies are concerned, it is trivial.

2. An multithreaded processor can even execute instructions from another SHA hash calculation which WILL be completely independent if dependencies were an issue.

Quote

This is a disadvantage of VLIW4/5 and has nothing to do with GPGPU.

Inadequate register/shared memory is a disadvantage of any GPU, not only VLIW ones. That makes them much less suitable for memory-intensive algorithms, even if they are embarassingly parallel. Moreover, resource-limited occupancy is a general GPGPU problem, far away from being bitcoin-related or VLIW-related, it is a problem even for ALU-bound kernels like the bitcoin one.

I quoted the section on ALUPacking, not GPR usage.

Quote from: AzN1337c0d3r

SHA256 hashing only needs 1 thing: lots and lots of ALUs. Stuff like AVX, threads, and x86 just introduce structural dependencies that slow down the process.

Note in my original post never once did I mention that bitcoin mining just requires tons and tons and tons of ALUs on the GPU. You just assumed it that way because you work extensively with GPUs.

Quote

Of course you would increase the GPR if you increase the number of ALU per CU. But note that a CU generates a structural dependency, a dependency that we created in order to accommodate GPGPU.

What makes you think you would increase the registers count if you increase the ALU units? I see....mmmm....no relation between both.

You just said that bitcoin mining is limited by the number of GPRs... there aren't enough of them per ALU.

Quote

If you were to make an ASIC miner, you sure as hell dont need a crapton of GPRs or CUs or "wavefront"... all you would need is tons and tons of ALUs.

Quote

Care to elaborate what "ALU" means in terms of ASIC?

Units which do AND OR XOR NOT bitshift maybe integer arithmetic should be sufficient.

gat3way

sr. member

Activity: 256

Merit: 250

Quote

Well this is quite dishonest, you're claiming one scrypt operation can keep each KC-core utilized? I doubt it. It will probably require more than 4 scrypt operation per KC-core to keep the hardware utilize, so the memory shoots up 12.8 GB... also out of practical limits for an add-on board.

It can keep it. In the CPU world you don't hide latencies by scheduling other threads on a core when a memory-bound thread is stalled on a memory access - the opposite, context switches are expensive (hyperthreading being a special exception here but there you have two register sets per core, and things are different). Since you are "1337 c0d3r" I assume you've written a compute-intensive multithreaded application some time ago. Did increasing the threads count beyond the number of CPUs improve performance because you somehow "utilized" the cores better? Or did just the opposite happen because all you did was introducing scheduling contention?

Quote

You can decouple the fetching and decoding from the execution. Instructions do not execute until they are ready.

Again, pipelines can't help much in the situation where an instruction depends on the result of a previous one. SHA256 has 64 steps, and each step depends on the result of the previous one. Now there are a number of independent instructions within each step, this is not unlimited though.

Quote

This is a disadvantage of VLIW4/5 and has nothing to do with GPGPU.

Inadequate register/shared memory is a disadvantage of any GPU, not only VLIW ones. That makes them much less suitable for memory-intensive algorithms, even if they are embarassingly parallel. Moreover, resource-limited occupancy is a general GPGPU problem, far away from being bitcoin-related or VLIW-related, it is a problem even for ALU-bound kernels like the bitcoin one.

Quote

Of course you would increase the GPR if you increase the number of ALU per CU. But note that a CU generates a structural dependency, a dependency that we created in order to accommodate GPGPU.

What makes you think you would increase the registers count if you increase the ALU units? I see....mmmm....no relation between both.

Quote

If you were to make an ASIC miner, you sure as hell dont need a crapton of GPRs or CUs or "wavefront"... all you would need is tons and tons of ALUs.

Care to elaborate what "ALU" means in terms of ASIC?

AzN1337c0d3r

full member

Activity: 238

Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Quote

bitcoin mining by definition has nothing to do with parallel computing. You could do it on a slow single-core MIPS processor for example, of course it would not be cost-effective.

Now you're just being pedantic. Modern GPU bitcoin mining kernels are written to take advantage of SIMD hardware by doing exactly the parallelization like I described earlier.

Quote

Provided that there is a non-dependent instruction to fetch/decode/execute.

You can decouple the fetching and decoding from the execution. Instructions do not execute until they are ready.

Quote

Blah blah blah need more GPRs per CU

If you were to make an ASIC miner, you sure as hell dont need a crapton of GPRs or CUs or "wavefront"... all this is needed to support bitcoin mining under a GPGPU model.

All you would need is tons and tons of ALUs for an ASIC.

gat3way

sr. member

Activity: 256

Merit: 250

Quote

Wouldn't the same exact restrictions on GPU applies to KC then? In fact it would pretty much apply to all large parallel architectures.

A modern-day GPU like 7970 has 2048 SPs and since in order to keep the hardware utilized you need at least 4 wavefronts/CU, you should schedule the kernel with a NDRange of at least 8192. That makes 8192*16 = 131GB of video memory which is far beyond what 7970 actually has. KC has 50 cores, assuming you use SSE2 registers (cause AVX does not have integer arithmetic on ymms - it would come with AVX2), then you end up with the equivalent of 200 GPU workitems or 200*16=3200MB of memory, quite within practical limits.

Quote

Just what do you think bitcoin mining is?

bitcoin mining by definition has nothing to do with parallel computing. You could do it on a slow single-core MIPS processor for example, of course it would not be cost-effective.

Quote

You clearly do not understand how computer architecture or dependency resolution works. Waiting on the results of a previous instruction? Just start work on a non-dependent instruction.

Provided that there is a non-dependent instruction to fetch/decode/execute.

Quote

The current bitcoin mining hardware is restricted by the number of ALUs. True there might be a point where you need to increase the registers, decoders, scheduling to feed all those ALUs, but that's not until you increase the ALU resources drastically from where they are now.

Of course it is limited by the number of ALU units, but that's not the only limitation. It is also limited by occupancy. Occupancy is limited by the GPR usage. Even a well-optimized kernel uses enough GPRs to limit the number of wavefronts/CU. That's why for example on VLIW hardware, 2-component vectors worked best. uint2 does not provide enough independent instructions to utilize the VLIW4/VLIW5 bundles and so ALUPacking was far from 100%. On the other hand, going to say uint4 while improving ALUPacking, ironically worsens performance because it requires more GPRs thus less wavefronts can be scheduled and we have less occupancy, underutilizing the hardware. AMD could make their hardware much better suited to bitcoin mining (and not only) if they increased their GPUS' register file, but they decided that would be enough. Generally yes, putting more ALUs would make the hardware faster but also having more GPRs per CU would definitely make it faster too. There isn't much use in more ALUs if you can't keep them busy. Right now, bitcoin kernels are a compromise and the hardware is never completely utilized.

Even better example being NVidia's Kepler. 680GTX has 1536 ALUs, that's three times more than a fast Fermi GPU like 580GTX. Anyway, practical results show 580 is faster than 680 at bitcoin mining (and any other ALU-intensive GPGPU work for that matter). The reason? They went from grouping 32 cores in a CU to 192 cores but instead of increasing 6 times the register file, they did that just 3 times and you end up with having 2 times less registers than you used to have with Fermi. The result being you can't have proper occupancy and alas - the 3x increase in ALUs is practically money for nothing. Kepler is not a GPGPU arch and GK110 unfortunately is not diverging away from that.

AzN1337c0d3r

full member

Activity: 238

Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Quote from: gat3way on June 12, 2012, 09:46:22 AM

scrypt is sequential memory-hard algorithm and is dependent (mostly) on memory access speed. Since memory requirements are enormous (16MB per single scrypt operation by default) and since access pattern is random, CPU caches would not help a lot. It does help much more with bcrypt though where the state is just 4KB and can fit in L1 cache completely. scrypt is unimplementable on GPUs for other reasons, mostly that there is just not enough memory - assuming you can utilize ~2GB of videoram, that would feed 128 workitems or 2 wavefronts which is quite inadequate to keep even one of the CUs properly utilized and definitely not enough to hide any memory latency.

Wouldn't the same exact restrictions on GPU applies to KC then? In fact it would pretty much apply to all large parallel architectures.

Quote from: gat3way on June 12, 2012, 09:46:22 AM

You would likely benefit a _lot_ more if you can run several independent SHA256 operations on those X ALU units in parallel manner.

Just what do you think bitcoin mining is?

You clearly do not understand how computer architecture or dependency resolution works. Waiting on the results of a previous instruction? Just start work on a non-dependent instruction.

The current bitcoin mining hardware is restricted by the number of ALUs. True there might be a point where you need to increase the registers, decoders, scheduling to feed all those ALUs, but that's not until you increase the ALU resources drastically from where they are now.

Bitcoin mining is an embarassingly parallel problem. There is basically NO work to be done to enable more parallelism besides throwing more ALUs at the problem.

gat3way

sr. member

Activity: 256

Merit: 250

Quote from: AzN1337c0d3r on June 08, 2012, 05:58:52 PM

Actually, litecoin is dependent on cache-speed (see this page), so depending on the cache-organization in Knight's Corner, it might not even end up being faster than a regular CPU.

scrypt is sequential memory-hard algorithm and is dependent (mostly) on memory access speed. Since memory requirements are enormous (16MB per single scrypt operation by default) and since access pattern is random, CPU caches would not help a lot. It does help much more with bcrypt though where the state is just 4KB and can fit in L1 cache completely. scrypt is unimplementable on GPUs for other reasons, mostly that there is just not enough memory - assuming you can utilize ~2GB of videoram, that would feed 128 workitems or 2 wavefronts which is quite inadequate to keep even one of the CUs properly utilized and definitely not enough to hide any memory latency.

Quote

SHA256 hashing only needs 1 thing: lots and lots of ALUs. Stuff like AVX, threads, and x86 just introduce structural dependencies that slow down the process.

No, it needs parallelism. Lot's of ALUs won't help unless you can feed them with enough work to do. SHA256 round steps are sequential and there are dependencies, thus you won't perform one SHA256 operation X times faster by just throwing in X times more ALUs. You would likely benefit a _lot_ more if you can run several independent SHA256 operations on those X ALU units in parallel manner.

aqrulesms

sr. member

Activity: 373

Merit: 250

Quote from: phorensic on November 18, 2011, 11:52:42 AM

1 TeraFLOPS. If I remember right from a while ago, somebody mentioned that FP isn't really used for hashing and it can't tell you the speed the CPU/GPU will hash at, right? Still, I think this thing might have some hashing power in it. Could it compete with an AMD GPU?

http://www.dailytech.com/Intel+Shows+22nm+50Core+Knights+Corner+CPU/article23299.htm

Reading that I thought it said 1 TeraHash/s. Almost gave me a heart attack Tongue

Littleshop

legendary

Activity: 1386

Merit: 1004

Quote from: ElectricMucus on December 05, 2011, 10:55:11 AM

Parallel Architectures can be incredibly powerful and will one day replace anything currently made with CPUs, GPUs and FPGAs.

GPU's are a parallel architecture.

AzN1337c0d3r

full member

Activity: 238

Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

Quote from: Gabi on November 18, 2011, 02:25:17 PM

But it will be very useful for the socalled "cpu coins" like litecoins and others.

Actually, litecoin is dependent on cache-speed (see this page), so depending on the cache-organization in Knight's Corner, it might not even end up being faster than a regular CPU.

Quote from: deslok on November 19, 2011, 12:50:49 AM

Quote from: Dexter770221 on November 18, 2011, 02:05:30 PM

x86 architecture, so 1MH/s per 1GHz. Lets assume that every core runs at that 1GHz then you get 50MH/s. A8-3850 Llano with 400 shader processors ("cores") on stock gets 70MH/s.

You forget 4 threads/core and avx are both something it can leverage. It has much potential but until it's available we wont really know, but.just look at what quick sync did for video transcoding... Game changer, probably not. contender? Good odds

SHA256 hashing only needs 1 thing: lots and lots of ALUs. Stuff like AVX, threads, and x86 just introduce structural dependencies that slow down the process.

ArsenShnurkov

legendary

Activity: 1386

Merit: 1000

Manuals for programmers:

http://software.intel.com/en-us/forums/showthread.php?t=105443

http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Parallel Architectures can be incredibly powerful and will one day replace anything currently made with CPUs, GPUs and FPGAs.

But there is no commercially available product yet which can be utilized for cost-effective bitcoin mining. The are architectures out there which could be one day but they are made with a too big feature size. With current technology it should be possible to have over 1000 rudimentary processors per watt. 200 per watt is already available made with an older production process.

I won't tell people which product I am talking about but let it be said that if it were to be produced in 22 nm it would get over a magnitude more performance than it currently has.

deslok

sr. member

Activity: 462

Merit: 250

It's all about the game, and how you play it

Quote from: Dexter770221 on November 18, 2011, 02:05:30 PM

x86 architecture, so 1MH/s per 1GHz. Lets assume that every core runs at that 1GHz then you get 50MH/s. A8-3850 Llano with 400 shader processors ("cores") on stock gets 70MH/s.

You forget 4 threads/core and avx are both something it can leverage. It has much potential but until it's available we wont really know, but.just look at what quick sync did for video transcoding... Game changer, probably not. contender? Good odds

Littleshop

legendary

Activity: 1386

Merit: 1004

Quote from: Gabi on November 18, 2011, 02:25:17 PM

Quote

Could it compete with an AMD GPU?

Of course not. It's not a GPU.

But it will be very useful for the socalled "cpu coins" like litecoins and others.

And, more important, for scientific things that run bad on GPU cause of low cache.

Correct. This is a pretty specialized cpu and is not going to be good for most tasks. It has nothing like the number of cores in a GPU, but they are easier to program. It won't make your desktop faster with a 1ghz clock rate, and it won't make your graphics faster then a standard GPU. But for science and complex formulas that can be broken into threads, it is going to be killer.

And if for some reason CPU coins are still around, it will be choice of CPU coin miners and do in standard cpu's for that job.

Gabi

legendary

Activity: 1148

Merit: 1008

If you want to walk on water, get out of the boat

Quote

Could it compete with an AMD GPU?

Of course not. It's not a GPU.

But it will be very useful for the socalled "cpu coins" like litecoins and others.

And, more important, for scientific things that run bad on GPU cause of low cache.

Dexter770221

legendary

Activity: 1029

Merit: 1000

x86 architecture, so 1MH/s per 1GHz. Lets assume that every core runs at that 1GHz then you get 50MH/s. A8-3850 Llano with 400 shader processors ("cores") on stock gets 70MH/s.

phorensic

hero member

Activity: 630

Merit: 500

1 TeraFLOPS. If I remember right from a while ago, somebody mentioned that FP isn't really used for hashing and it can't tell you the speed the CPU/GPU will hash at, right? Still, I think this thing might have some hashing power in it. Could it compete with an AMD GPU?

http://www.dailytech.com/Intel+Shows+22nm+50Core+Knights+Corner+CPU/article23299.htm

Topic: Intel 50-core Knights Corner (Read 6026 times)