Looks like Intel is going with chiplet designs for it's upcoming gpu's.
Lest anyone doubt that the 4-tile GPU doesn't actually exist and is merely a publicity stunt, Raja whipped out the large package and briefly flashed it at the camera during his Hot Chips presentation. And yes, it's really big — much bigger than any other chip package we've seen.
Xe HP only uses EMIB to scale to multi-tile configurations. Xe HPC will also include a Rambo Cache tile, Foveros die stacking, and Co-EMIB with additional enhancements. Ponte Vecchio is planned for use in the upcoming Aurora supercomputer, and it was supposed to be manufactured on Intel's now-delayed 7nm lithography.
In the meantime, Intel now has 1-tile, 2-tile, and 4-tile Xe HP silicon in its labs. As you'd expect, the EMIB linking means the packages for the latter two are basically 2x and 4x the size of the base design, so the GPUs require three separate sockets.
The 4-tile implementation of Xe HP Raja showed off is capable of around 42 TFLOPS of FP32 compute. However, that's not actually the maximum capability. Raja also mentioned that the 4-tile chip is capable of reaching "petaflops scale computing," or >1000 TFLOPS. That's thanks to the presence of tensor cores, though we don't know the exact configuration.
Like
Nvidia's A100 architecture and Google's TPUv4, Xe HP supports tensor cores. We assume these are capable of 128 operations per cycle, with one tensor core per EU. With 2048 EUs, that gives us:
2048 × 128 × 2 (FMA) = 524,288
We're missing clock speed, which would suggest either a 2GHz baseline for one petaflop, or potentially a different tensor core arrangement that can do more than 128 ops per clock. Either way, it should make it much easier for supercomputers to reach the level of exascale computing.
https://www.tomshardware.com/news/raja-koduri-petaflops-scale-4-tile-xe-hp-gpu-hot-chips