AMD Excavator Core May Bring Dramatic Performance Increases

From X-bit Labs: As Advanced Micro Devices is preparing to launch its next-generation microprocessors with Steamroller high-performance x86 cores, enthusiasts are revealing secrets about its fourth-generation Bulldozer core code-named Excavator. As it appears, that processing engine will support 256-bit AVX2 floating point instructions, which may mean that it will feature rather revolutionary changes from existing Bulldozer cores.

AMD recently released a patch to the GCC community that enables support for its future high-performance micro-architecture code-named Excavator, which the chip developer calls "bdver4" internally. The initial patch is designed to bring very general support of Excavator to Linux operating system, but even that general support may reveal some of the secrets the Excavator may have. Based on the information released by AMD, the Excavator will support all the instructions found in the modern Intel code-named Haswell microprocessors, including SSE4.1, SSE4.2, AES, PCLMUL, AVX, BMI, F16C, MOVBE, AVX2, BMI2, RDRND and so on.

The most important disclosure is support for AVX2 instructions introduced by Intel Haswell earlier this year. While such instructions are barely used today and therefore may not be considered important nowadays, they require major hardware changes from previous generations, something that is clearly important for the future.

The original AVX brought 256-bit floating-point SIMD instructions, the AVX2 allows to operate with the AVX 256-bit wide YMM register for integer data types. The problem with current AMD hardware is that the Bulldozer FPU only supports 128-bit integer operations used in the XOP instruction set, reports HardwareLuxx web-site.

To support AVX2 instructions, AMD will need to either considerably upgrade its FPU [floating point unit], which is shared between two ALUs in a Bulldozer module, or even develop a new one from scratch. The new one will expectedly feature dramatic performance improvements, but even a redesigned one should be noticeably faster than existing one in numerous demanding applications that process loads of data.

View: Article @ Source Site