I don't see any big breakthrough in high-performance computing field yet. One evidence is that PathScale was purchased by QLogic with only $109M. Considering VCs had spent around $60M. This evaluation was not regared as high at all.
Actually the performance bottleneck lies between CPU and Memory. That is why so many levels of CACHE are put in between. For multi-processors, the issues become even messy because the memory coherence issue must also be handled.
If you want to do something in high-performace CPU design, I would suggest you to take a look at open SPARC project from SUN. I guess after you undertood the released RTL code, you would figure out a way to design a high-perfomance multi-core and multi-threaded system.