AMD Calls New FPU "Flex FP", Defends Dual FMAC Approach

From X-bit Labs: Advanced Micro Devices continies to share secrets about its forthcoming products that belong to code-named Bulldozer. Recently the company expained in details its new floating point unit (FPU) called "Flex FP" that promises to deliver high performance computing and be very efficient in terms of die size and power consumption.

As is known, Bulldozer processors consists of several so-called modules. Each module has two integer engines as well as one "Flex FP" FPU consisting of two 128-bit FMAC units that share with its own schedule. The approach is different from a hypothetical 256-bit FPU with appropriate data paths that would be underutilized oftern. Moreover, unified scheduler for both FP and integer execution units would also be less efficient, according to AMD.

"Each Flex FP has its own scheduler; it does not rely on the integer scheduler to schedule FP commands, nor does it take integer resources to schedule 256-bit executions. This helps to ensure that the FP unit stays full as floating point commands occur. Our competitors’ architectures have had single scheduler for both integer and floating point, which means that both integer and floating point commands are issued by a single shared scheduler vs. having dedicated schedulers for both integer and floating point executions," said John Fruehe, the director of product marketing for server/workstation products at AMD.

Modern 128-bit FPUs can execute four single precision commands or two double precision commands in parallel per cycle. The yet-to-come AVX technology allows to execute eight 32-bit commands or four 64-bit commands per cycle. However, once a program does not support AVX then: "that flashy new 256-bit FPU only executes in 128-bit mode". This is naturally a blow for Intel's 256-bit FPU of Sandy Bridge processor.

The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, or each of the integer cores can execute 128-bit commands simultaneously.

View: Article @ Source Site