miércoles, 1 de agosto de 2007

Intel presenta Polaris, su Chip con 80 cores


detail Meet Polaris - it's the North Star, y'know
By Charlie Demerjian: domingo 11 febrero 2007, 19:52
THE ROADMAP to high end chips is now more than ever dominated by interconnects and the ability to get data in, out and around the chip.
Couple that with a trend toward more task specific CPUs and you have a new "paradigm" in the works. Those paradigms are shown off in Intel's Polaris chip.
Polaris was the 80 core CPU shown off at the last IDF as a demo for teraflop computing on a chip. To put this in perspective, an ACM article (1) estimated that in 1988, it could be done in 100 megawatts, but chips like the 68882 could drop that by a notable amount. They theorised that five megawatts was possible including cooling with some advances in tech.
The first one that was actually built was ASCI Red at Sandia National Lab. It was 104 cabinets housing 10,000 Pentium Pros and spread out over 2500 square feet. It consumed a mere 500kw, yay progress. Polaris does this 10 years later in 275 square mm and consumes 62W when doing so.

As you can see, part Polaris is made up of tiles, identical tiles, 80 of them in an 8 * 10 arrangement. Each tile does not do very much, this is a test chip, not a general purpose CPU. The core has two FP engines, data and instruction memory and a router. The main point of this chip is the router to test mesh interconnects..





When you have a chip capable of more than a teraflop, you need a way to get the data to feed it on and off the chip. The router is a 6 port unit that will shuffle a mere 80GB/s around with a 1.25ns latency. If you consider that the chip has 80 of these, it can send a lot of bits to and fro, and that is the point of Polaris. At 3.16 GHz, the bisection bandwidth of the chip is 1.62 Tbps. The router can send data to it's neighbors in each of the four directions as well as up to the stacked memory that Intel won't talk about yet. The last link goes to the core itself.

The routing algorithm used isn't all that complex, it is just a simple wormhole setup. You make a path between routers, send the data down, and close the link, it is a virtual pipe. This simplicity is one of the ways they can get the latency so low.
The main point of Polaris is just that, to route data around, but there are other things tested here, power savings and potentially disparate cores. Power savings is nothing new, but when you have a router on the core that needs low latency, it can get tricky. In the sleep modes, the FPs can save 90% of peak power, memory can cut back over 50% but the router can only drop down 10%. Latency does not play well with sleep modes.
That brings us to the future, as in why should we care about a test chip that can hit a somewhat arbitrary number of calculations? The answer is twofold, this is two of the next big things for Intel on one chip, and will be a third as soon as they talk about the stacked memory.
The first is the whole idea of asymmetric cores. If you have a mesh that can shuffle data around willy nilly, you don't need the same things at all nodes. The nodes are independent of the IO functionality, so as long as they have the right interface and understand the protocols, you can put anything you want on a tile.
Right now, you have two FP units, a couple of chunks of ram and a little control circuitry. Replace that with a full x86 CPU and you start to see the possibilities. Replace half of them with x86 CPUs, a quarter with GPUs, toss in a physics co-processor and few other things and you start to see the point.
With a mesh base and a tiled set of chips, you can tailor CPUs to almost any need you want. You can also make the same architecture have 5, 20 or 100 tiles, Celeron, Core Number Numeral and Xeon, all nice and tidy. Easy to design, manufacture and customize.
The other bit is the mesh itself. Computing has gone from shared busses to point to point interconnects like HT. On die, and sometimes off die, you have switches and crossbar interconnects to get the data around. Those devices don't scale all that well, nor do ring busses when you are talking about hundreds of cores instead of a few.
That is where meshes come in, they will take Intel from a cap in the tens of cores to potentially hundreds or thousands. Polaris is about flexibility as well as expandability. It also is a very obvious pointer as to where Intel is going at the end of the decade and beyond.
In the end, Polaris doesn't really do all that much from a functional perspective. It can calculate a teraflop, but that isn't all that useful in the real world. Expect a next gen Polaris to be much more functional in the general sense, followed by things you can buy with meshes. µ
(1) Frey, A. H. and Fox G. C. "Problems and Approaches for a Teraflop Processor", Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1, 1988

No hay comentarios: