NVIDIA Blackwell Ultra Dominates MLPerf Inference Benchmarks

2026-05-18

NVIDIA’s Blackwell Ultra has arrived, and it’s rewriting the rules of AI inference performance. The chipmaker’s latest GPU architecture, unveiled in 2025 and now shipping in volume, has dominated the latest MLPerf Inference benchmarks, cementing NVIDIA’s position at the heart of the emerging AI factory era.

The Blackwell Ultra, packaged in the GB300 rack-scale design, achieved unprecedented scores across multiple inference workloads, particularly in the newly introduced reasoning benchmark category. This marks the first time a single GPU architecture has so thoroughly dominated both training and inference leaderboards.

“Blackwell represents the most significant architectural leap in NVIDIA’s history,” said Jensen Huang, CEO of NVIDIA, during the recent GTC conference. “We’ve designed every transistor with inference efficiency in mind. The results speak for themselves.”

The key to Blackwell’s inference supremacy lies in its second-generation Transformer Engine, which introduces support for MXFP4 and MXFP6 quantization formats. This 4-bit floating point capability allows Blackwell to deliver 20 petaflops of FP4 compute while dramatically reducing memory bandwidth requirements.

According to NVIDIA’s technical blog, the NVFP4 format combines two-level scaling—an FP8 micro-block scale applied to 16-value blocks plus a tensor-level FP32 scale—enabling unprecedented efficiency without sacrificing accuracy.

The MLPerf results showed Blackwell Ultra outperforming the previous generation Hopper architecture by up to 3x on inference workloads, with particularly strong gains in large language model serving and reasoning tasks. This performance leap arrives at a critical moment as AI inference demand begins to dwarf training demand.

IEEE Spectrum reported that NVIDIA topped MLPerf’s new reasoning benchmark with its Blackwell Ultra GPU, demonstrating the architecture’s suitability for the next generation of reasoning models that require significantly more compute per token than traditional language models.

The implications extend beyond raw performance. Lower inference costs enable new use cases that were previously economically unviable. At $0.10 per million tokens—the current cost floor for inference on Blackwell—companies can deploy AI at scale without the massive infrastructure investments that characterized the training era.

Industry analysts note that Blackwell’s inference dominance could accelerate the shift from training-focused to inference-focused capital expenditure. “We’re entering the inference era, and NVIDIA has ensured they’ll remain the platform of choice,” said one prominent technology analyst.

The Blackwell architecture also introduces all-new NVLink connectivity, enabling multiple GPUs to work together with minimal latency. This is crucial for serving the largest models, where memory capacity and bandwidth often become bottlenecks.

Competitors AMD and Intel have made progress in inference efficiency, but NVIDIA’s ecosystem advantage—CUDA, TensorRT-LLM, and the broader software stack—remains difficult to replicate. The MLPerf results reflect not just hardware capability but the maturity of NVIDIA’s inference optimization pipeline.

Looking ahead, NVIDIA has already hinted at next-generation improvements. The company’s roadmap suggests annual architecture updates, with each generation targeting specific efficiency gains for inference workloads. The AI factory era has arrived, and NVIDIA is building the machinery.

Written by: JuniorWriter