Pages:
Author

Topic: Xeon Phi - page 2. (Read 36483 times)

legendary
Activity: 2128
Merit: 1073
June 28, 2012, 10:11:08 PM
I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
I think the confusion runs deeper that just me.

Here's the quote from the "Knights Corner Performance Monitoring Units";
Intel's  document number: 327357-001

Quote
2. 4-Way Threaded: Each Knights Corner core is able to process 4 threads concurrently.
mrb
legendary
Activity: 1512
Merit: 1028
June 28, 2012, 08:38:53 PM
#99
I wonder how this number is going to change once we include the information that the basic core resembles Pentium which was dual pipeline and that now the cores are 4 way hyperthreaded.

Sorry, I got a little confused.

I see why you are confused. 2112 meant superscalar, not hyperthreaded, as pointed out by others.
legendary
Activity: 1148
Merit: 1008
If you want to walk on water, get out of the boat
June 28, 2012, 02:22:07 PM
#98
I remember reading that on the Xeon Phi will run a version of Linux. And from it you can run things

Get a computer with 3-4 of these and run BOINC on them. Epic computing power @Home  Cheesy
legendary
Activity: 2128
Merit: 1073
June 28, 2012, 05:36:27 AM
#97
Which means it'll actually boot an OS. Dont know if you cant boot your own though. Can't imagine why you wouldn't be able to though.
Yeah, after further thought I now assume that calling it a co-processor is just an artificial market segmentation. Intel probably has an agreement with Cray, SGI, etc. to let them announce their supercomputers as first standalone systems using Xeon Phi. Then maybe later the second-tier vendors like Microway will announce single/dual/quad Xeon Phi workstations.

This is very clearly a product targeted for the OpenMP market.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 27, 2012, 11:15:08 PM
#96
Well, the real problem is, they want to be able to boot existing x86 code on it. Not merely run, but boot.

According to Anandtech:

Quote
Meanwhile on the software side of things in an interesting move Intel is going to be equipping Xeon Phi co-processors with their own OS, in effect making them stand-alone computers (despite the co-processor designation) and significantly deviating from what we’ve seen on similar products (i.e. Tesla). Xeon Phis will be independently running an embedded form of Linux, which Intel has said will be of particular benefit for cluster users. Drivers of course will still be necessary for a host device to interface with the co-processor, with the implication being that these drivers will be fairly thin and simple since the co-processor itself is already running a full OS.

Which means it'll actually boot an OS. Dont know if you cant boot your own though. Can't imagine why you wouldn't be able to though.
legendary
Activity: 2128
Merit: 1073
June 27, 2012, 08:50:48 PM
#95
Well, the real problem is, they want to be able to boot existing x86 code on it. Not merely run, but boot.
Well, I was thinking of coprocessor as something directly accessible through the QuickPath that doesn't require an OS at all. For example what AMD does to support FPGA in Opteron sockets over HyperChannel. Such co-processor wouldn't need to boot in the classic OS sense, more like it would need to support "reset" without resetting the neighboring CPU.

I'm thinking they're this: Atom-like cores, dual issue, in order execution, no x87 FPU, and a 512 bit SIMD unit that does both integer and fp, 32?kb of L1, and a small amount of L2.

Now, given that sounds shitty, but if I can run normal threads on those instead of lockstep thread clusters and the SIMD units support booleans (512 of them at a time) or chars (64 at a time), this could actually end up with surprisingly fast mining.
I think Knight had granted your wishes, mostly. There's still support for legacy FP, but XMM & YMM registers are replaced by ZMM. There's no support for chars, but there is for Int32 and Int64. If you were thinking of bit-slice parallel implementation for miner then those Int* types will allow that. Multiprocessing and miltithreading is all compliant with OpenMP.

The docs for architecture are near the bottom of this page:

http://software.intel.com/en-us/forums/showthread.php?t=105443

The instruction set is supported by the recent Intel C/C++ and Fortran compilers. The GNU port was just to compile the Linux kernel and doesn't really support the new instructions.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 27, 2012, 05:18:04 PM
#94
It is unclear whether Xeon Phi can dual-issue LRBni (512-bit) instructions (and has enough execution units to execute 2 per cycle), or can only do it for x86-64 (32/64-bit) instructions. I assumed the former, hence my 280 Mhash/s estimate. If not, performance would be 560 Mhash/s as 2112 pointed out.
I agree that unclear is the operative word. I have feeling that Intel's James Reinders is heavily under the influence of the marketing department. There's some talk that the current board is a coprocessor, yet the ISA manual clearly shows the unit booting in the 16-bit segmented real mode.

Intel is either doing artificial market segmentation or something didn't work out in the memory controller/quickpath/chipset interface portion of the design.

I also wonder if the references to the "original Pentium" are similar to the branding exercise that happened with the announcement of the orignal Atoms. The Atoms had completely redesigned microarchitecture called Bonnell, but with various features disabled. Yet the marketing described them as "reissue of the classic Pentium" while pretty much the only thing that they had in common was lack of deep speculation and in-order execution.

Well, the real problem is, they want to be able to boot existing x86 code on it. Not merely run, but boot.

I'm thinking they're this: Atom-like cores, dual issue, in order execution, no x87 FPU, and a 512 bit SIMD unit that does both integer and fp, 32?kb of L1, and a small amount of L2.

Now, given that sounds shitty, but if I can run normal threads on those instead of lockstep thread clusters and the SIMD units support booleans (512 of them at a time) or chars (64 at a time), this could actually end up with surprisingly fast mining.
legendary
Activity: 2128
Merit: 1073
June 27, 2012, 04:52:12 PM
#93
It is unclear whether Xeon Phi can dual-issue LRBni (512-bit) instructions (and has enough execution units to execute 2 per cycle), or can only do it for x86-64 (32/64-bit) instructions. I assumed the former, hence my 280 Mhash/s estimate. If not, performance would be 560 Mhash/s as 2112 pointed out.
I agree that unclear is the operative word. I have feeling that Intel's James Reinders is heavily under the influence of the marketing department. There's some talk that the current board is a coprocessor, yet the ISA manual clearly shows the unit booting in the 16-bit segmented real mode.

Intel is either doing artificial market segmentation or something didn't work out in the memory controller/quickpath/chipset interface portion of the design.

I also wonder if the references to the "original Pentium" are similar to the branding exercise that happened with the announcement of the orignal Atoms. The Atoms had completely redesigned microarchitecture called Bonnell, but with various features disabled. Yet the marketing described them as "reissue of the classic Pentium" while pretty much the only thing that they had in common was lack of deep speculation and in-order execution.
legendary
Activity: 952
Merit: 1000
June 27, 2012, 01:22:25 PM
#92
I wonder how this number is going to change once we include the information that the basic core resembles Pentium which was dual pipeline and that now the cores are 4 way hyperthreaded.

Sorry, I got a little confused.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 27, 2012, 12:41:20 PM
#91
Given all these assumptions, a Xeon Phi card should mine at roughly 280 Mhash/s, or about as fast as a low end HD 7850. Not impressive.
I wonder how this number is going to change once we include the information that the basic core resembles Pentium which was dual pipeline and that now the cores are 4 way hyperthreaded.

From the Pentium days I remember straightforward Fortran & C code easily retiring more than 1 instruction per clock: 1.3-1.6 with nothing more than "-Ofast".

I would double your number to 560 Mhash/s. This should be a safe assumption that the pipeline utilization could get close to 100%.

I thought hyperthreading only have like a 25% performance boost tops?

Superscalar is not hyperthreading.
mrb
legendary
Activity: 1512
Merit: 1028
June 27, 2012, 12:33:17 PM
#90
I thought hyperthreading only have like a 25% performance boost tops?

2112 is not talking about hyperthreading. He is talking about the U and V pipelines of the original Pentium CPU, where a single core, a single thread, can execute up to 2 instructions per clock.

It is unclear whether Xeon Phi can dual-issue LRBni (512-bit) instructions (and has enough execution units to execute 2 per cycle), or can only do it for x86-64 (32/64-bit) instructions. I assumed the former, hence my 280 Mhash/s estimate. If not, performance would be 560 Mhash/s as 2112 pointed out.
legendary
Activity: 952
Merit: 1000
June 27, 2012, 12:04:08 PM
#89
Given all these assumptions, a Xeon Phi card should mine at roughly 280 Mhash/s, or about as fast as a low end HD 7850. Not impressive.
I wonder how this number is going to change once we include the information that the basic core resembles Pentium which was dual pipeline and that now the cores are 4 way hyperthreaded.

From the Pentium days I remember straightforward Fortran & C code easily retiring more than 1 instruction per clock: 1.3-1.6 with nothing more than "-Ofast".

I would double your number to 560 Mhash/s. This should be a safe assumption that the pipeline utilization could get close to 100%.

I thought hyperthreading only have like a 25% performance boost tops?
legendary
Activity: 2128
Merit: 1073
June 27, 2012, 12:00:34 PM
#88
Given all these assumptions, a Xeon Phi card should mine at roughly 280 Mhash/s, or about as fast as a low end HD 7850. Not impressive.
I wonder how this number is going to change once we include the information that the basic core resembles Pentium which was dual pipeline and that now the cores are 4 way hyperthreaded.

From the Pentium days I remember straightforward Fortran & C code easily retiring more than 1 instruction per clock: 1.3-1.6 with nothing more than "-Ofast".

I would double your number to 560 Mhash/s. This should be a safe assumption that the pipeline utilization could get close to 100%.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 22, 2012, 07:34:20 PM
#87
Counting memory chips isn't going to tell you if a graphics board is ECC capable.

The Quadro 6000s I have at work have the traditional 384-bit memory bus (6x64 bits) found in the GTX480.
member
Activity: 66
Merit: 10
June 22, 2012, 06:40:26 PM
#86
I was also under the impression the 7970 did have ECC, but I thought it was not used. It would cost performance? My impression is based from some articles and postings such as the two below.

I read this sometime back on: http://www.anandtech.com/Show/Index/4455?cPage=4&all=False&sort=0&page=6&slug=amds-graphics-core-next-preview-amd-architects-for-compute

Quote
Finally on the memory side, AMD is adding proper ECC support to supplement their existing EDC (Error Detection & Correction) functionality, which is used to ensure the integrity of memory transmissions across the GDDR5 memory bus. Both the SRAM and VRAM memory can be ECC protected. For the SRAM this is a free operation, while for the VRAM there will be a performance overhead. We’re assuming that AMD will be using a virtual ECC scheme like NVIDIA, where ECC data is distributed across VRAM rather than using extra memory chips/controllers.

Shamino has done some LN2 overclocking when the 7970 was released, in his forum he wrote for the 7970,

Quote
actually 1800 ram is easy, i ran 2000 ram and it got the ECC correction and the score was worse.

http://kingpincooling.com/forum/showthread.php?t=1559

ECC on external RAM like that is done by adding more chips. If these were DIMMs, you'd have DIMMs with 9 chips instead of 8.

The easy way to figure this out is if someone finds a picture of the ref board naked and count the chips.

But would counting the number of chips tell whether the gpu support ECC? Probably on DIMMs they may have the extra memory chips to spread the bits equally through chipkill (e.g. 13-bit word = 8-bit data and 5-bit parity, needs to be spread across 13 DRAM chips)? The 7970 most likely has a virtual scheme to implement ECC, that is through BCH code or Hamming as possible examples - the Anandtech article I previously posted made note of it probably being a virtual ECC implementation. DIMMS with 9 chips do parity on 1 bit, multiple ECC DIMMs can do multiple parity bits across DIMMS. As for the 7970, relying on chip-virtual ECC implementation allows for multi-bit errors to be corrected/detected more conveniently and cheaper.

I really doubt they're doing that, it would add too much complexity to the memory controllers. So, yes, find a 7970 photo, count the memory chips, and post the number in here.

Nah, I'm sure you can search for a ref board and count ram chips - besides it it a new architecture, why not implement ECC? ECC does not mean adding more RAM chips as I mentioned before - there are a number of articles out there that refer to ECC on the 7970, but yes, I'm sure you can search for them if you want to find out.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 22, 2012, 06:23:16 PM
#85
I was also under the impression the 7970 did have ECC, but I thought it was not used. It would cost performance? My impression is based from some articles and postings such as the two below.

I read this sometime back on: http://www.anandtech.com/Show/Index/4455?cPage=4&all=False&sort=0&page=6&slug=amds-graphics-core-next-preview-amd-architects-for-compute

Quote
Finally on the memory side, AMD is adding proper ECC support to supplement their existing EDC (Error Detection & Correction) functionality, which is used to ensure the integrity of memory transmissions across the GDDR5 memory bus. Both the SRAM and VRAM memory can be ECC protected. For the SRAM this is a free operation, while for the VRAM there will be a performance overhead. We’re assuming that AMD will be using a virtual ECC scheme like NVIDIA, where ECC data is distributed across VRAM rather than using extra memory chips/controllers.

Shamino has done some LN2 overclocking when the 7970 was released, in his forum he wrote for the 7970,

Quote
actually 1800 ram is easy, i ran 2000 ram and it got the ECC correction and the score was worse.

http://kingpincooling.com/forum/showthread.php?t=1559

ECC on external RAM like that is done by adding more chips. If these were DIMMs, you'd have DIMMs with 9 chips instead of 8.

The easy way to figure this out is if someone finds a picture of the ref board naked and count the chips.

But would counting the number of chips tell whether the gpu support ECC? Probably on DIMMs they may have the extra memory chips to spread the bits equally through chipkill (e.g. 13-bit word = 8-bit data and 5-bit parity, needs to be spread across 13 DRAM chips)? The 7970 most likely has a virtual scheme to implement ECC, that is through BCH code or Hamming as possible examples - the Anandtech article I previously posted made note of it probably being a virtual ECC implementation. DIMMS with 9 chips do parity on 1 bit, multiple ECC DIMMs can do multiple parity bits across DIMMS. As for the 7970, relying on chip-virtual ECC implementation allows for multi-bit errors to be corrected/detected more conveniently and cheaper.

I really doubt they're doing that, it would add too much complexity to the memory controllers. So, yes, find a 7970 photo, count the memory chips, and post the number in here.
member
Activity: 66
Merit: 10
June 22, 2012, 05:36:25 PM
#84
I was also under the impression the 7970 did have ECC, but I thought it was not used. It would cost performance? My impression is based from some articles and postings such as the two below.

I read this sometime back on: http://www.anandtech.com/Show/Index/4455?cPage=4&all=False&sort=0&page=6&slug=amds-graphics-core-next-preview-amd-architects-for-compute

Quote
Finally on the memory side, AMD is adding proper ECC support to supplement their existing EDC (Error Detection & Correction) functionality, which is used to ensure the integrity of memory transmissions across the GDDR5 memory bus. Both the SRAM and VRAM memory can be ECC protected. For the SRAM this is a free operation, while for the VRAM there will be a performance overhead. We’re assuming that AMD will be using a virtual ECC scheme like NVIDIA, where ECC data is distributed across VRAM rather than using extra memory chips/controllers.

Shamino has done some LN2 overclocking when the 7970 was released, in his forum he wrote for the 7970,

Quote
actually 1800 ram is easy, i ran 2000 ram and it got the ECC correction and the score was worse.

http://kingpincooling.com/forum/showthread.php?t=1559

ECC on external RAM like that is done by adding more chips. If these were DIMMs, you'd have DIMMs with 9 chips instead of 8.

The easy way to figure this out is if someone finds a picture of the ref board naked and count the chips.

But would counting the number of chips tell whether the gpu support ECC? Probably on DIMMs they may have the extra memory chips to spread the bits equally through chipkill (e.g. 13-bit word = 8-bit data and 5-bit parity, needs to be spread across 13 DRAM chips)? The 7970 most likely has a virtual scheme to implement ECC, that is through BCH code or Hamming as possible examples - the Anandtech article I previously posted made note of it probably being a virtual ECC implementation. DIMMS with 9 chips do parity on 1 bit, multiple ECC DIMMs can do multiple parity bits across DIMMS. As for the 7970, relying on chip-virtual ECC implementation allows for multi-bit errors to be corrected/detected more conveniently and cheaper.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 22, 2012, 04:27:53 PM
#83
Protip: Xeon Phi run x86 code

Good luck using the 7970 (or nvidia) computing power, having to fight with opencl and cuda.

Highly unlikely Phi is going to run x86 code unmodified.

Also, OpenCL isn't bad, it's the whole making things "parallel" that is the hard part of any problem.

Actually, it might run x86 code unmodified. That just isn't the best way to performance on those machines.
legendary
Activity: 1162
Merit: 1000
DiabloMiner author
June 22, 2012, 04:25:44 PM
#82
I was also under the impression the 7970 did have ECC, but I thought it was not used. It would cost performance? My impression is based from some articles and postings such as the two below.

I read this sometime back on: http://www.anandtech.com/Show/Index/4455?cPage=4&all=False&sort=0&page=6&slug=amds-graphics-core-next-preview-amd-architects-for-compute

Quote
Finally on the memory side, AMD is adding proper ECC support to supplement their existing EDC (Error Detection & Correction) functionality, which is used to ensure the integrity of memory transmissions across the GDDR5 memory bus. Both the SRAM and VRAM memory can be ECC protected. For the SRAM this is a free operation, while for the VRAM there will be a performance overhead. We’re assuming that AMD will be using a virtual ECC scheme like NVIDIA, where ECC data is distributed across VRAM rather than using extra memory chips/controllers.

Shamino has done some LN2 overclocking when the 7970 was released, in his forum he wrote for the 7970,

Quote
actually 1800 ram is easy, i ran 2000 ram and it got the ECC correction and the score was worse.

http://kingpincooling.com/forum/showthread.php?t=1559

ECC on external RAM like that is done by adding more chips. If these were DIMMs, you'd have DIMMs with 9 chips instead of 8.

The easy way to figure this out is if someone finds a picture of the ref board naked and count the chips.
full member
Activity: 238
Merit: 100
★YoBit.Net★ 350+ Coins Exchange & Dice
June 22, 2012, 03:53:50 PM
#81
Protip: Xeon Phi run x86 code

Good luck using the 7970 (or nvidia) computing power, having to fight with opencl and cuda.

Highly unlikely Phi is going to run x86 code unmodified.

Also, OpenCL isn't bad, it's the whole making things "parallel" that is the hard part of any problem.
Pages:
Jump to: