Pages:
Author

Topic: Xeon Phi (Read 36483 times)

newbie
Activity: 4
Merit: 0
March 27, 2014, 10:48:41 AM
Quote
does it work with scrypt?

sorry, only sha256d is supported at this time.
at first glance, efficient scrypt implementation does not look as easy as sha256d.
full member
Activity: 201
Merit: 100
March 27, 2014, 10:31:40 AM
i ported Pooler's cpuminer to Intel Xeon Phi so it takes advantage of the 512 bits registers and compute 16 hashes at once.
i was able to test is on Intel Xeon Phi series 5100 (60 cores, 240 threads @ 1.053 GHz)

so far, i was able to measure 140 MHash/s (using 240 threads).
it was interesting to note that if using 60 threads (using one thread per core) i was able
to achieve 65 MHash/s, which means there could be some room for optimisation, and 260 MHash/s
could be achieved.

on the other hand, i could only achieve 13.8 MHash/s on 240 threads by using the plain C code
(e.g. no use of 512 bit registers use)

the code (Linux only) can be downloaded at https://github.com/kiyominer/cpuminer

Cheers,

kiyo

Hi, does it work with scrypt? thanks.
newbie
Activity: 4
Merit: 0
March 27, 2014, 09:42:53 AM
i ported Pooler's cpuminer to Intel Xeon Phi so it takes advantage of the 512 bits registers and compute 16 hashes at once.
i was able to test is on Intel Xeon Phi series 5100 (60 cores, 240 threads @ 1.053 GHz)

so far, i was able to measure 140 MHash/s (using 240 threads).
it was interesting to note that if using 60 threads (using one thread per core) i was able
to achieve 65 MHash/s, which means there could be some room for optimisation, and 260 MHash/s
could be achieved.

on the other hand, i could only achieve 13.8 MHash/s on 240 threads by using the plain C code
(e.g. no use of 512 bit registers use)

the code (Linux only) can be downloaded at https://github.com/kiyominer/cpuminer

Cheers,

kiyo
legendary
Activity: 2128
Merit: 1073
November 18, 2012, 07:50:25 AM
Im getting a shot of one next week, has anyone got the code to get it mining? Thanks
pooler's cpuminer will mine both Bitcoins and Litecons. You'll have to recompile to take advantage of the new instructions and massive multithreading. To fully utilize the long vector units you'll probably need to restructure to loops somewhat.

https://bitcointalksearch.org/topic/an-even-more-optimized-version-of-cpuminer-poolers-cpuminer-cpu-only-55038
newbie
Activity: 45
Merit: 0
November 18, 2012, 04:37:30 AM
Im getting a shot of one next week, has anyone got the code to get it mining? Thanks
legendary
Activity: 2128
Merit: 1073
July 10, 2012, 05:00:35 PM
Just an interesting tidbit I've found on the 2nd pass through the Knights Corner documentation:
Quote
EBX[23:16] = 248; // Maximum number of logical processors
248/4 = 62 not 50.

Why could that be?

1) A leftover from Knights Ferry?
2) Yield with all 62 cores enabled would be zero?
3) Something else?

Please share your guesses.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 30, 2012, 05:19:40 AM
Thats an unusually strict definition of SMT. Issuing instructions from more than one thread at a time to fill load requirements over multiple ALUs is not a requirement to be SMT.

Dude, why do you think it's called simultaneous multithreading then?

If you have found a less strict definition somewhere from an authoritative source, please do share. All research papers I've read regarding SMT has had that definition.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 30, 2012, 05:17:57 AM

I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

You seriously still not do understand what SMT is. If you have Computer Architecture: A Quantitative Approach, I suggest you go flip to the chapter on multithreading and reading it.

Also look at slide 35 from this Powerpoint presentation on threading.

Ultrasparc T1 cannot possibly be SMT. From wikipedia article on SMT:

Quote
The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

Thats an unusually strict definition of SMT. Issuing instructions from more than one thread at a time to fill load requirements over multiple ALUs is not a requirement to be SMT.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 30, 2012, 05:12:47 AM

I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

You seriously still not do understand what SMT is. If you have Computer Architecture: A Quantitative Approach, I suggest you go flip to the chapter on multithreading and reading it.

Also look at slide 35 from this Powerpoint presentation on threading.

Ultrasparc T1 cannot possibly be SMT. From wikipedia article on SMT:

Quote
The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 30, 2012, 05:00:20 AM
Quote from: DiabloD3
AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Quote from: DiabloD3
Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I think you have a fundamental misunderstanding of what SMT is as evidenced by the quotes above. SMT is neither fine-grained nor coarse-grained multithreading.

SMT is simultaneous multithreading so there is no switching. Instructions from all threads are eligible to be issued at any given clock cycle.

Quote from: DiabloD3
I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

From Wikipedia's UltraSPARCT1 article:

Quote
Each core is a barrel processor, meaning it switches between available threads each cycle.

It doesn't switch on a block, it switches EVERY cycle. You can confirm this if you read Sun's architecture manuals (T1 has been open-sourced)

I know what SMT is, and it only describes one core doing multiple threads simultaniously. It doesn't describe how, coarse or not. T1s, Intel P4s, Intel i*s with HT, Bulldozers, and Radeons can all be described as SMT. They just don't all do it the same way.

As for T1s doing it on block, this is what Sun advertised it as. I'm not surprised their marketing department got it slightly wrong, so I'll let you have that one.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 30, 2012, 04:46:32 AM
Quote from: DiabloD3
AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Quote from: DiabloD3
Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I think you have a fundamental misunderstanding of what SMT is as evidenced by the quotes above. SMT is neither fine-grained nor coarse-grained multithreading.

SMT is simultaneous multithreading so there is no switching. Instructions from all threads are eligible to be issued at any given clock cycle.

Quote from: DiabloD3
I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

From Wikipedia's UltraSPARCT1 article:

Quote
Each core is a barrel processor, meaning it switches between available threads each cycle.

It doesn't switch on a block, it switches EVERY cycle. You can confirm this if you read Sun's architecture manuals (T1 has been open-sourced)
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 29, 2012, 09:42:14 PM
switch to the next on anything that would block execution.
That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote
Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.

I wasn't arguing coarse vs fine. I was just arguing on who does what.

Niagaras really do switch on block, but the newer ones might just switch every cycle now that they have a FP unit per core instead of per socket.

Radeons switch every VLIW clause (which can be up to 128 instructions long) due to the unique register layout.

AMD is SMT if you look at it that way, but I look at it much finer grained than SMT: not only do you get SMT, but instructions are scheduled to run on the next free ALU. Fine grained "switch every cycle" would waste resources as ALUs would have no work to run most of the time.

Intel Hyperthreading switches on block on P4s, and I think on i*s that have it they switched to every cycle.

I mean, what I'm trying to say is, all modern high performance archs use the same set of tricks, but its at what level do they exploit them. Its rumored that POWER in the future (9? 10?) will just be something like 64 threads piping into one core, where that one core has like 128 int ALUs and a similar number of FPUs, which is probably the future of non-lockstep highly parallel programming.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 29, 2012, 07:14:22 PM
switch to the next on anything that would block execution.
That's the definition of coarse-grained multithreading and I believe you're mistaken. No major processor architecture implements coarse-grained multithreading.

The Ultrasparc T1/T2 switches thread on every cycle, which is the definition of fine-grained multithreading.

The IBM Power series have implemented true SMT since Power5.

AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

AMD's implementation is actually SMT if you only regard the int ALU resources. The innovation they made is that they share FP resources across two cores (or they have dedicated integer resources for each core if you want to look at it that way).

Quote
Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.

That's fine-grained multithreading then because it's basically switching to a new thread every cycle.
legendary
Activity: 2128
Merit: 1073
June 29, 2012, 03:06:03 PM
Its only hyperthreaded? Thats kinda pointless altogether.
It is both superscalar (2-way) and hyperthreaded (4-way).

Another quote from the same manual:
Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi,
but hyperthreading should make the "adjustment" work easier. The threads will not be fighting for cache lines, which is the most common cause for not gaining the performance in hyperthreaded processors.

Anyway, we'll see. I'm not really up to downloading Intel compilers, compiling the code and analyzing the assembly. Maybe in winter?
mrb
legendary
Activity: 1512
Merit: 1028
June 29, 2012, 01:54:06 AM
Its only hyperthreaded? Thats kinda pointless altogether.

I use this term liberally. I have no idea what type of threading Xeon Phi will implement (the GPU way: switching unconditionally to the next thread on each instruction; or the CPU way: switching to the next thread when the current one would wait on memory).

But either way, it does not matter to us. Mining is an embarrassingly parallel workload, so an implementation can be adjusted to fully exploit the ALU resources of Xeon Phi, and whatever type of threading Xeon Phi implement will not add supplemental performance.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 29, 2012, 01:29:12 AM
I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.


Its only hyperthreaded? Thats kinda pointless altogether.
mrb
legendary
Activity: 1512
Merit: 1028
June 29, 2012, 01:25:11 AM
I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Xeon Phi is hyperthreaded (the vendor-neutral term for this is SMT = symmetric multithreading), but as I am sure you know SMT does not increase the performance at all of ALU-bound workloads. Therefore we can ignore SMT when making theoretical estimations of the performance of bitcoin mining.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 29, 2012, 12:57:05 AM
I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.

Thats almost the same trick. Ultrasparc T[1-4]s and newer IBM POWERs have multiple thread decoders, and switch to the next on anything that would block execution. AMD Bulldozers do the same, but have a semi-unified scheduler that schedules the next instruction (from one of two already decoded streams) onto the next ALU (2 threads -> 4 integer ALUs and 2 FP ALUs).

Radeons, however, don't switch on block. They automatically assume there is memory latency and results won't be available for four pipeline executions later. It makes the hardware simpler and easier to design compilers for. I suspect Knights Corner is more like a Radeon than a Niagara in this case.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 29, 2012, 12:39:48 AM
I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.

I remember reading somewhere that MIC was supposed to be barrel-threaded (ie. fine-grained multithreading), somewhat akin to Ultrasparc T1. I can't find the source now though.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 28, 2012, 11:59:50 PM
I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.

Yeah, but I suspect thats the Radeon trick: have a pipeline 4 issue deep, so memory latency is effectively hid. It probably cant switch on demand.
Pages:
Jump to: