Pages:
Author

Topic: Cell/B.E. LTC Superminer Project (Read 1718 times)

newbie
Activity: 53
Merit: 0
November 17, 2013, 04:18:51 AM
#23
For those of you who are following the project, check the buildlog - I updated it.

That being said, we could still use more contributors, in any range from hardware to software or even financial/logistic concerns.

There's an amazing amount of work between idea and implementation, especially when production is considered.

B.H.
newbie
Activity: 53
Merit: 0
November 13, 2013, 02:33:56 AM
#22
Couldn't help myself, had to mock up a 19" rack panel for fun.

Let me guys know what you think. I'm going to probably design the chassis around a blue glow theme. The 6 slots are for 7-segment readouts, the I/O connectors are panel-mount Neutriks, and we have a big red emergency start button for fun.

This panel is designed for if I decide to watercool the unit. The radiator and fans are moved up front for heat in the winter and because the front of the rig is usually the least impeded (if it's on a desk or in a rack, the rear plate will be up against a wall or enclosure panel; we want efficiency - since when have you put a computer against the wall facing the wrong way?).

Again, let me know what you guys think.

The upload is of bad quality but I think it looks pretty snazzy.

http://i1272.photobucket.com/albums/y384/blasthash/CellBE_Supercomputer-1_zps2c89570b.jpg

It's a start. Maybe have the option of liquid cooling it for silent operation, but that's just a luxury.

True. I'm probably going to build the first R&D prototype with a liquid cooler for efficiency, although production, when we get to that point, will most likely have it as an option.
hero member
Activity: 518
Merit: 500
November 13, 2013, 02:27:34 AM
#21
Couldn't help myself, had to mock up a 19" rack panel for fun.

Let me guys know what you think. I'm going to probably design the chassis around a blue glow theme. The 6 slots are for 7-segment readouts, the I/O connectors are panel-mount Neutriks, and we have a big red emergency start button for fun.

This panel is designed for if I decide to watercool the unit. The radiator and fans are moved up front for heat in the winter and because the front of the rig is usually the least impeded (if it's on a desk or in a rack, the rear plate will be up against a wall or enclosure panel; we want efficiency - since when have you put a computer against the wall facing the wrong way?).

Again, let me know what you guys think.

The upload is of bad quality but I think it looks pretty snazzy.



It's a start. Maybe have the option of liquid cooling it for silent operation, but that's just a luxury.
newbie
Activity: 53
Merit: 0
November 13, 2013, 01:24:21 AM
#20
Couldn't help myself, had to mock up a 19" rack panel for fun.

Let me guys know what you think. I'm going to probably design the chassis around a blue glow theme. The 6 slots are for 7-segment readouts, the I/O connectors are panel-mount Neutriks, and we have a big red emergency start button for fun.

This panel is designed for if I decide to watercool the unit. The radiator and fans are moved up front for heat in the winter and because the front of the rig is usually the least impeded (if it's on a desk or in a rack, the rear plate will be up against a wall or enclosure panel; we want efficiency - since when have you put a computer against the wall facing the wrong way?).

Again, let me know what you guys think.

The upload is of bad quality but I think it looks pretty snazzy.

http://i1272.photobucket.com/albums/y384/blasthash/CellBE_Supercomputer-1_zps2c89570b.jpg
newbie
Activity: 53
Merit: 0
November 12, 2013, 11:33:54 PM
#19
Would the ps3 cpu be any use for scrypt jane coins? the efficiency of this chip seems remarkable.

Ideally, it'd be useable for any Scrypt protocol (and routine SHA-256 as well, albeit at much less efficiency (hashes per watt) than GPUs).

The efficiency is what I find to be impressive. Under full load, the PS3 slim consumes right around 70W. If we assume roughly 95% conversion for the power supply, then 67W is the draw to the board, and allowances for the GPU at roughly half plus peripherals puts power consumption of the Cell in most applications below 25W.

If we can succeed with this project to production, it would not be infeasible to sell two models of the Superminer, a 15- or 20-core variety and a 40 core menace. I'll have to see some code analytics to see what we can get to in terms of per-chip hashrate, but knowing that awhile ago, someone developed a Scrypt miner for a Linux PS3 and managed to eek out 34kH using 6 SPEs, probably not using vector instructions to their fullest, you can see the headroom that such a device could allow for.

For the first R&D build I'm going to really showcase this thing - it needs to look the part.

When I get to building, major miner-pr0n will result.
legendary
Activity: 2100
Merit: 1167
MY RED TRUST LEFT BY SCUMBAGS - READ MY SIG
November 12, 2013, 10:40:50 PM
#18
Would the ps3 cpu be any use for scrypt jane coins? the efficiency of this chip seems remarkable.
newbie
Activity: 53
Merit: 0
November 12, 2013, 08:59:30 AM
#17
I'll give everyone a bit more information to chew on, then I'm calling it quits for the night, otherwise my brain will be fried from thinking too much over multi-core architectures  Huh

Methods of constructing the Superminer at the macroscopic (inter-processor) level:

1: SC/DC (Smart Computer, Dumb Chips) - Master computer (standard PC on Linux distro, etc.) commands FPGAs which 'hot-inject' block data into RAM, essentially 'force-feeding' the processors. Each member of array dependent on total number of processors for what its specific task in the noncespace should be. Least nasty code, and UART data push to FPGAs simple to achieve in high level. Code would have to be altered if units were to be added or subtracted, lest the code increase in complexity for auto-detection.

2. Homogenous master node - One Cell unit boots a full Linux OS on its PPE core, and delegates its SPEs and other processor units as to what task to carry out. More nasty code, but array is not dependent on the number of processors present - that is, the code will be the same for 2 or 10 CPUs present.

3. "Democratic" node setup - All nodes (CPUs) arrayed in a headless server configuration. This takes the least work, and the least amount of inter-processor communication, at the expense of being the closest in complexity to a bunch of PS3s mounted on custom PCBs.

Personally, while I endorse the first one the most as I feel it is the 'easiest' method to implement without being a cheap-shot, the second method allows for quite possibly the most optimized results. The only issue here is that we would have to have more information on how interprocessor communication functions.

This is where we need more help. The southbridge for the Cell processor has PCI endpoints, but we need to know how it communicates and reads over that bus. This is a definite candidate for IP comm, and prevents the processor chain from having to package and publicize the data over a LAN connection like it would for GbE communication. If any of you guys know how such a communication scheme would work into code, you'd be much wanted in the project circle.

Also, if you check the second post (the project information/status post), I've added a GitHub repo link as well as a LTC address that is earmarked for project research. Anything helps, and even if you can't contribute knowledge, you'll still be able to know you helped get the Superminer on two legs.

The first options sound the more doable option in the short term. The second is more stream lined then the first. The last one would be the ideal one in the long run.

My thoughts exactly. It stands to reason that if I get creative with the implementation, the hardware would need to see minimal changeover between the various iterations. Currently the FPGA I'm considering is a Spartan-6 LXT150; which would have enough I/O to pump 6 to 8 different cores with enough data.

What I was originally thinking was having each processors BIOS chip (by the principle that the BIOS flash exists at address 0x0; the first address) with the assembly code for the operation, but have each processors noncespace share coded into the assembly. While this is a simple route, it is the most expensive in terms of upgrade - each processor would have to be reflashed if the system config changed. If we get smart with the FPGA implementation and code to the CPUs though, we could have the FPGA 'prime' each processor with its share prior to work.

IBM has been nice enough to take the approach that it won't even supply datasheets without a design consultation for custom specs; as this is almost certainly out of the price scope of this project, it isn't being pursued. Right now I have two things on the chopping block:

a) Reconstruct the ballout accurately from leaked Sony service manuals;

b) You'll also notice if you take a look that there is a separate I/O channel. This is almost certainly MMIO; but without a datasheet we don't know what address each pin stands at. Until that can be tested, the only way I can think of doing this is to 'hotwire' the communications FPGA into the RAM addressing. This comes at the cost of RAM real-estate, but as store and load instructions are highly-latent instructions, ideally this processing would be using primarily only the caches in the first place.
hero member
Activity: 518
Merit: 500
November 12, 2013, 08:38:33 AM
#16
I'll give everyone a bit more information to chew on, then I'm calling it quits for the night, otherwise my brain will be fried from thinking too much over multi-core architectures  Huh

Methods of constructing the Superminer at the macroscopic (inter-processor) level:

1: SC/DC (Smart Computer, Dumb Chips) - Master computer (standard PC on Linux distro, etc.) commands FPGAs which 'hot-inject' block data into RAM, essentially 'force-feeding' the processors. Each member of array dependent on total number of processors for what its specific task in the noncespace should be. Least nasty code, and UART data push to FPGAs simple to achieve in high level. Code would have to be altered if units were to be added or subtracted, lest the code increase in complexity for auto-detection.

2. Homogenous master node - One Cell unit boots a full Linux OS on its PPE core, and delegates its SPEs and other processor units as to what task to carry out. More nasty code, but array is not dependent on the number of processors present - that is, the code will be the same for 2 or 10 CPUs present.

3. "Democratic" node setup - All nodes (CPUs) arrayed in a headless server configuration. This takes the least work, and the least amount of inter-processor communication, at the expense of being the closest in complexity to a bunch of PS3s mounted on custom PCBs.

Personally, while I endorse the first one the most as I feel it is the 'easiest' method to implement without being a cheap-shot, the second method allows for quite possibly the most optimized results. The only issue here is that we would have to have more information on how interprocessor communication functions.

This is where we need more help. The southbridge for the Cell processor has PCI endpoints, but we need to know how it communicates and reads over that bus. This is a definite candidate for IP comm, and prevents the processor chain from having to package and publicize the data over a LAN connection like it would for GbE communication. If any of you guys know how such a communication scheme would work into code, you'd be much wanted in the project circle.

Also, if you check the second post (the project information/status post), I've added a GitHub repo link as well as a LTC address that is earmarked for project research. Anything helps, and even if you can't contribute knowledge, you'll still be able to know you helped get the Superminer on two legs.

The first options sound the more doable option in the short term. The second is more stream lined then the first. The last one would be the ideal one in the long run.
newbie
Activity: 53
Merit: 0
November 12, 2013, 08:33:30 AM
#15
The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

That's one of the principal issues - the source of units themselves.

Right now the only practical route is to buy busted PS3s for around the $50 range and flow off the Cells. Keeping in mind, a quad i7 is about $150 for the laptop variety, so provided nothing else, it's still not a bad deal. Test and resell the PSUs, drives and Wi-Fi / BT modules, and with any luck, the processor might be valued at free.

Problem is, a malfunctioning PS3 doesn't exactly have a steady price, but for R&D purposes the second-hand route works fine. If this project can produce a functioning prototype, I may start a crowd-sourced Kickstarter to finance a small contract to IBM for the chips - they're keeping these things locked down TIGHT.

A slight bonus that I'll do once I get the reflow chip salvage up and running for members of the forum is sell the mobos and some of the RSXs (essentially an Nvidia GPU) for experimenting/parts replacement/wall decoration, and for cheap.

True, price on broken PS3 can vary per person's personal value. On craigslist, I've seen as low as $15 to as high as $100 for broken ones, but on average between $20 to $50.

Exactly. I've already snagged one for $50 or so on eBay that I'm going to use to see if I can fix it and get it working as a standalone miner in the meantime, and if not, pure R&D and chip snag.
newbie
Activity: 53
Merit: 0
November 12, 2013, 08:32:24 AM
#14
I'll give everyone a bit more information to chew on, then I'm calling it quits for the night, otherwise my brain will be fried from thinking too much over multi-core architectures  Huh

Methods of constructing the Superminer at the macroscopic (inter-processor) level:

1: SC/DC (Smart Computer, Dumb Chips) - Master computer (standard PC on Linux distro, etc.) commands FPGAs which 'hot-inject' block data into RAM, essentially 'force-feeding' the processors. Each member of array dependent on total number of processors for what its specific task in the noncespace should be. Least nasty code, and UART data push to FPGAs simple to achieve in high level. Code would have to be altered if units were to be added or subtracted, lest the code increase in complexity for auto-detection.

2. Homogenous master node - One Cell unit boots a full Linux OS on its PPE core, and delegates its SPEs and other processor units as to what task to carry out. More nasty code, but array is not dependent on the number of processors present - that is, the code will be the same for 2 or 10 CPUs present.

3. "Democratic" node setup - All nodes (CPUs) arrayed in a headless server configuration. This takes the least work, and the least amount of inter-processor communication, at the expense of being the closest in complexity to a bunch of PS3s mounted on custom PCBs.

Personally, while I endorse the first one the most as I feel it is the 'easiest' method to implement without being a cheap-shot, the second method allows for quite possibly the most optimized results. The only issue here is that we would have to have more information on how interprocessor communication functions.

This is where we need more help. The southbridge for the Cell processor has PCI endpoints, but we need to know how it communicates and reads over that bus. This is a definite candidate for IP comm, and prevents the processor chain from having to package and publicize the data over a LAN connection like it would for GbE communication. If any of you guys know how such a communication scheme would work into code, you'd be much wanted in the project circle.

Also, if you check the second post (the project information/status post), I've added a GitHub repo link as well as a LTC address that is earmarked for project research. Anything helps, and even if you can't contribute knowledge, you'll still be able to know you helped get the Superminer on two legs.
hero member
Activity: 518
Merit: 500
November 12, 2013, 08:22:08 AM
#13
The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

That's one of the principal issues - the source of units themselves.

Right now the only practical route is to buy busted PS3s for around the $50 range and flow off the Cells. Keeping in mind, a quad i7 is about $150 for the laptop variety, so provided nothing else, it's still not a bad deal. Test and resell the PSUs, drives and Wi-Fi / BT modules, and with any luck, the processor might be valued at free.

Problem is, a malfunctioning PS3 doesn't exactly have a steady price, but for R&D purposes the second-hand route works fine. If this project can produce a functioning prototype, I may start a crowd-sourced Kickstarter to finance a small contract to IBM for the chips - they're keeping these things locked down TIGHT.

A slight bonus that I'll do once I get the reflow chip salvage up and running for members of the forum is sell the mobos and some of the RSXs (essentially an Nvidia GPU) for experimenting/parts replacement/wall decoration, and for cheap.

True, price on broken PS3 can vary per person's personal value. On craigslist, I've seen as low as $15 to as high as $100 for broken ones, but on average between $20 to $50.
newbie
Activity: 53
Merit: 0
November 12, 2013, 07:37:27 AM
#12
The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?

That's one of the principal issues - the source of units themselves.

Right now the only practical route is to buy busted PS3s for around the $50 range and flow off the Cells. Keeping in mind, a quad i7 is about $150 for the laptop variety, so provided nothing else, it's still not a bad deal. Test and resell the PSUs, drives and Wi-Fi / BT modules, and with any luck, the processor might be valued at free.

Problem is, a malfunctioning PS3 doesn't exactly have a steady price, but for R&D purposes the second-hand route works fine. If this project can produce a functioning prototype, I may start a crowd-sourced Kickstarter to finance a small contract to IBM for the chips - they're keeping these things locked down TIGHT.

A slight bonus that I'll do once I get the reflow chip salvage up and running for members of the forum is sell the mobos and some of the RSXs (essentially an Nvidia GPU) for experimenting/parts replacement/wall decoration, and for cheap.
hero member
Activity: 518
Merit: 500
November 12, 2013, 06:05:00 AM
#11
The theory is sound. I'll keep an eye on this project. How much is the cell processor go per unit or bulk?
hero member
Activity: 798
Merit: 1000
‘Try to be nice’
November 12, 2013, 05:55:33 AM
#10
Sounds interesting keep us posted . I love this shit.

Dont know if I can help , I need to find out. Looks like a hella fun project but .
newbie
Activity: 53
Merit: 0
November 12, 2013, 03:49:10 AM
#9
For anyone who's keeping tabs on this project, I've changed the name of the thread topic to better reflect what the purpose of this is.
newbie
Activity: 53
Merit: 0
November 11, 2013, 02:50:12 AM
#8
Interesting.

However, the main bottleneck of scrypt is ram access. If the CPU you choose has small amount of cache memory then you are bounded by the RAM throughput anyway. Let's assume you are using DDR3-2133 SDRAM. Then the throughput is 2133*10^6*(data bus width in bytes) bytes/sec. The scrypt requires 128Kb of data to be transferred back and forth per a single calculated hash. That means that having 64bit bus you may achieve 2133*10^6×8/(128*1024×2) = 65093 hashes a sec. If a CPU has a 32bit data bus then divide this number by 2. This is theoretical maximum. The actual performance would be lower as to be capable to perform the necessary calculations even in parallel with this memory reads and writes the CPU must execute 90 651 115 520 assembler instructions on 32bit words to meet these 65093 hashes a sec (if it has no provisions of parallel computations). And this estimation does not include the necessary SHA calculations.

Now, let's look if we use some powerful CPU that has several cores and big internal cache - big enough to keep 128k data per core without interaction with external RAM. For example Intel Core 2 Quad Q9550    32.2 Kh/s. (source: https://litecoin.info/Mining_hardware_comparison#Intel).

Thoughts?
First of all, just to point out, we're talking about one specific CPU in question here.

The Cell CPU has enough independent L cache (256K) to hold the data on each SPU. External RAM would be present for load/store instructions as well to give a safety buffer. At its present source, I plan on having an FPGA 'inject' the block and other data to be hashed into the RAM at a specific point. DRAM expires; but at a time frame we won't be worried about. That FPGA will then trigger an interrupt that will allow the CPU/SPE core to break out of a branching loop and start reading code off the address. With the vector units we have access to the full bore of 128 full-GP registers, each 128-bits in depth. Way code is looking now, even in a highly brutish and non-optimized state we will not need to use expensive memory stores or loads aside from the start and finish.

Obviously the FPGA will be running much slower than the SPE cores, and so it won't be able to respond to the pickup load off RAM as fast as the CPU can issue it. But all things considered, I think it's still a highly-optimizable process.
hero member
Activity: 574
Merit: 523
November 10, 2013, 10:24:37 PM
#7
Interesting.

However, the main bottleneck of scrypt is ram access. If the CPU you choose has small amount of cache memory then you are bounded by the RAM throughput anyway. Let's assume you are using DDR3-2133 SDRAM. Then the throughput is 2133*10^6*(data bus width in bytes) bytes/sec. The scrypt requires 128Kb of data to be transferred back and forth per a single calculated hash. That means that having 64bit bus you may achieve 2133*10^6×8/(128*1024×2) = 65093 hashes a sec. If a CPU has a 32bit data bus then divide this number by 2. This is theoretical maximum. The actual performance would be lower as to be capable to perform the necessary calculations even in parallel with this memory reads and writes the CPU must execute 90 651 115 520 assembler instructions on 32bit words to meet these 65093 hashes a sec (if it has no provisions of parallel computations). And this estimation does not include the necessary SHA calculations.

Now, let's look if we use some powerful CPU that has several cores and big internal cache - big enough to keep 128k data per core without interaction with external RAM. For example Intel Core 2 Quad Q9550    32.2 Kh/s. (source: https://litecoin.info/Mining_hardware_comparison#Intel).

Thoughts?
newbie
Activity: 53
Merit: 0
November 10, 2013, 10:03:42 PM
#6
CPU on PS4 (and XBox One) is 8 core "classic x86" CPU made by AMD.  Using the Cell processor on PS4 made porting games very difficult and 3rd party games are much more important.  That often meant that due to the Cell (which more than one game studio blasted as a nightmare to work on) PS4 ports would come much later.

http://www.anandtech.com/show/6976/amds-jaguar-architecture-the-cpu-powering-xbox-one-playstation-4-kabini-temash

Yeah. The Cell is admittedly a monster to try and code for, mainly due to the fact that that the cores are heterogenous. 1 is a PowerPC engine (PPE), eight are SPEs that use a completely different assembly and end-compiler. It's hard to get the code to cooperate, or at least from highly-integrated software standpoints (i.e. games).
donator
Activity: 1218
Merit: 1079
Gerald Davis
November 10, 2013, 10:00:40 PM
#5
CPU on PS4 (and XBox One) is 8 core "classic x86" CPU made by AMD.  Using the Cell processor on PS4 made porting games very difficult and 3rd party games are much more important.  That often meant that due to the Cell (which more than one game studio blasted as a nightmare to work on) PS4 ports would come much later.

http://www.anandtech.com/show/6976/amds-jaguar-architecture-the-cpu-powering-xbox-one-playstation-4-kabini-temash
newbie
Activity: 53
Merit: 0
November 10, 2013, 09:49:06 PM
#4
This does sound very interesting, i wasn't aware the ps3 cpu was so remarkable and efficient in terms of energy.  What cpu has the ps4 got?

There's no official info as far as I know, but it seems like they're ditching the Cell/B.E. Which is an utter waste because its development was over $100M between Sony, IBM, and Toshiba.

It may be working on a GPU-inspired CPU. Nvidia has really been pushing the general-processing power of their CUDA frameworks lately. It may stand that it expands a CPU on that concept.

And yes, the entire unit expends 10W or so at nominal frequency during normal operations. Constant vector ops may push it higher, but it stands to reason that this entire unit could cram 20 units (180 cores) for under the total dissipation of a desktop PC.

I'm working on getting the Cell SDK from IBM as we speak. Optimization on this code to get the performance gains I conjectured about in the buildlog will have to be tight; but by all means doable.

To do some basic, approximate math, if a key-search were distributed across all values of the nonce, running 20 CPUs on an 8-SPU 64-point vector op pipe would give us:

4,294,967,295 / (20 * 8 * 64) = 41,943 + some operations for each vector point to find the result. Even with non-optimized code, the array could complete this in under a second. It could work even faster if one extended the vector processing somewhat to the PPE core.

That's pretty remarkable results. I'll have to wait until I run some compile-tests on the actual code to get actual timing (start -> final) information, but that's amazing in-and-of itself.

My idea is to have an FPGA that performs memory "injection" - that is, say a Spartan-6 LXT - a fast one - that injects the block to be ran at a specific memory address. The CPUs would ideally be locked in a branch loop unless an interrupt signal was generated, in which case they would break out of the loop and start running the code.

It'll be a good bit to get to building the full 20-processor implementation, but with some R&D fund and resource management it won't be too bad.

Hope this gives you some good information.
Pages:
Jump to: