Intel’s 3rd-generation Xeon Scalable CPUs offer 16-bit FPU processing

  News
image_pdfimage_print

Intel today announced its third-generation Xeon Scalable (meaning Gold and Platinum) processors, along with new generations of its Optane persistent memory (read: extremely low-latency, high-endurance SSD) and Stratix AI FPGA products.

The fact that AMD is currently beating Intel on just about every conceivable performance metric except hardware-accelerated AI isn’t news at this point. It’s clearly not news to Intel, either, since the company made no claims whatsoever about Xeon Scalable’s performance versus competing Epyc Rome processors. More interestingly, Intel hardly mentioned general-purpose computing workloads at all.

Finding an explanation of the only non-AI generation-on-generation improvement shown needed jumping through multiple footnotes. With sufficient determination, we eventually discovered that the “1.9X average performance gain” mentioned on the overview slide refers to “estimated or simulated” SPECrate 2017 benchmarks comparing a four-socket Platinum 8380H system to a five-year-old, four-socket E7-8890 v3.

To be fair, Intel does seem to have introduced some unusually impressive innovations in the AI space. “Deep Learning Boost,” which formerly was just branding for the AVX-512 instruction set, now encompasses an entirely new 16-bit floating point data type as well.

With earlier generations of Xeon Scalable, Intel pioneered and pushed heavily for using 8-bit integer—INT8—inference processing with its OpenVINO library. For inference workloads, Intel argued that the lower accuracy of INT8 was acceptable in most cases, while offering extreme acceleration of the inference pipeline. For training, however, most applications still needed the greater accuracy of FP32 32-bit floating point processing.

The new generation adds 16-bit floating point processor support, which Intel is calling bfloat16. Cutting FP32 models’ bit-width in half accelerates processing itself, but more importantly, halves the RAM needed to keep models in memory. Taking advantage of the new data type is also simpler for programmers and codebases using FP32 models than conversion to integer would be.

Intel also thoughtfully provided a game revolving around the BF16 data type’s efficiency. We cannot recommend it either as a game or as an educational tool.

Optane storage acceleration

Intel also announced a new, 25 percent-faster generation of its Optane “persistent memory” SSDs, which can be used to greatly accelerate AI and other storage pipelines. Optane SSDs operate on 3D Xpoint technology rather than the NAND flash typical SSDs do. 3D Xpoint has tremendously higher write endurance and lower latency than NAND does. The lower latency and greater write endurance makes it particularly attractive as a fast caching technology, which can even accelerate all solid-state arrays.

The big takeaway here is that Optane’s extremely low latency allows acceleration of AI pipelines—which frequently bottleneck on storage—by offering very rapid access to models too large to keep entirely in RAM. For pipelines which involve rapid, heavy writes, an Optane cache layer can also significantly increase the life expectancy of the NAND primary storage beneath it, by reducing the total number of writes which must actually be committed to it.

Latency vs. IOPS, with a 70/30 read/write workload. The orange and green lines are data center-grade traditional NAND SSDs; the blue line is Optane.
Enlarge / Latency vs. IOPS, with a 70/30 read/write workload. The orange and green lines are data center-grade traditional NAND SSDs; the blue line is Optane.

For example, a 256GB Optane has a 360PB write-endurance spec, whereas a Samsung 850 Pro 256GB SSD is only specced for 150TB endurance—greater than a 1,000:1 advantage to Optane.

Meanwhile, this excellent Tom’s Hardware review from 2019 demonstrates just how far in the dust Optane leaves traditional data center-grade SSDs in terms of latency.

Stratix 10 NX FPGAs

Finally, Intel announced a new version of its Stratix FPGA. Field Gate Programmable Arrays can be used as hardware acceleration for some workloads, allowing more of the general-purpose CPU cores to tackle tasks that the FPGAs can’t.

Listing image by Intel

https://arstechnica.com/?p=1684956