Topic: Xeon Phi (Read 36483 times)

switch to the next on anything that would block execution.

That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

Radeons switch every VLIW clause (which can be up to 128 instructions long) due to the unique register layout.

AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I mean, what I'm trying to say is, all modern high performance archs use the same set of tricks, but its at what level do they exploit them. Its rumored that POWER in the future (9? 10?) will just be something like 64 threads piping into one core, where that one core has like 128 int ALUs and a similar number of FPUs, which is probably the future of non-lockstep highly parallel programming.

AzN1337c0d3r

full member

Activity: 238

Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

switch to the next on anything that would block execution.

That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

Quote from: DiabloD3 on June 29, 2012, 01:29:12 AM

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

2112

legendary

Activity: 2128

Merit: 1073

Its only hyperthreaded? Thats kinda pointless altogether.

It is both superscalar (2-way) and hyperthreaded (4-way).

Another quote from the same manual:

Quote from: mrb on June 29, 2012, 01:54:06 AM

Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi,

but hyperthreading should make the "adjustment" work easier. The threads will not be fighting for cache lines, which is the most common cause for not gaining the performance in hyperthreaded processors.

Anyway, we'll see. I'm not really up to downloading Intel compilers, compiling the code and analyzing the assembly. Maybe in winter?

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: DiabloD3 on June 29, 2012, 01:29:12 AM

Its only hyperthreaded? Thats kinda pointless altogether.

I use this term liberally. I have no idea what type of threading Xeon Phi will implement (the GPU way: switching unconditionally to the next thread on each instruction; or the CPU way: switching to the next thread when the current one would wait on memory).

But either way, it does not matter to us. Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi, and whatever type of threading Xeon Phi implement will not add supplemental performance.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: mrb on June 29, 2012, 01:25:11 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.

Its only hyperthreaded? Thats kinda pointless altogether.

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: AzN1337c0d3r on June 29, 2012, 12:39:48 AM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author

Quote from: DiabloD3 on June 28, 2012, 11:59:50 PM

Quote from: DiabloD3 on June 28, 2012, 11:59:50 PM

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

Thats almost the same trick. Ultrasparc T[1-4]s and newer IBM POWERs have multiple thread decoders, and switch to the next on anything that would block execution. AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

AzN1337c0d3r

full member

Activity: 238

Merit: 100

★YoBit.Net★ 350+ Coins Exchange & Dice

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.

I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's document number: 327357-001

Quote

2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

DiabloD3

legendary

Activity: 1162

Merit: 1000

DiabloMiner author