NVMe Over Fabric Technology Enables New Levels Of Storage Efficiency In Today’s Data Centers

In today’s connected world, enormous data is being generated in terms of the volume, variety and speed. It increases the burden on storage by capacity and performance. The rise of cloud computing, big data, social media and IoT makes the storage problem even worse. The cost of acquisition (capital expenses) and management of data (operating expenses) are sky rocketing for cloud service providers. The computational power is not enough to make sense of relevant data, which can get lost within the rapidly growing sea of information. Big data analytics can be a big nightmare for cloud service providers attempting to quickly monetize data. When information is relevant and has economic value, the data loses that value in time. It’s imperative to extract value from relevant data almost in real time.

Current data center operators and enterprises are scratching their heads to solve their immense storage needs and are attempting to monetize the data stored by applying big data analytics technologies.  This old architectures of storage has challenges with regard to performance, power consumption, management and monetizing data while scaling systems to zettabytes and beyond.

most powerful processors (2 x86s), switch, and IO cards (with support for a variety of protocols like FC, FCoE, Infiniband, Ethernet at various speeds) to connect to the fabric of data centers. The expanders connect to SAS (Serial Attached SCSI)/SATA (Serial ATA) hard disk drives (HDDs) and solid state drives (SSDs).  All storage services run in software on processors.

Even when SSDs replace HDDs, the older architectures are still limited to ~50K IOPs vs Non-Volatile Memory Express (NVMe) SSDs with performance measured in excess of 1M+ IOPs. The NVMe  (www.nvmexpress.org) is a new data bus that supports memory-based storage. SSDs with ~1M+ IOPs (input/output operations per sec) at <1/3 the latency of traditional SSDs are creating huge waves in the market as it is a standards based.

The following diagrams show the performance advantages of NVMe SSDs compared with traditional SAS and SATA-based SSDs. This case study was performed by SNIA (Storage Networking Industry Association).


Figure 1: Performance comparison of NVMe w SATA/SAS


Programmable logic is, indeed, a key component in reducing data center power consumption and accelerating computation. FPGAs can be used as hardware accelerators and can be reconfigured, as in the shell and role model, thus significantly increasing their value in the data center.  The Xilinx SDAccel™ Development Environment for data center workload acceleration can be used to reconfigure FPGAs to be purpose-built while supporting different applications on the same hardware.

The new NVMe over Fabric architecture shown in below Figure -2, scalable and optimized for a 3x-5x increase in performance and up to 100x lower latency with services implemented in FPGA-based hardware acceleration. The implementation of these services in Xilinx’s MPSoC (Multiprocessor System on Chip) has resulted in up to 100x lower latency compared to a standard x86. (For Compression of files)

The new storage architectures are evolving around scale-out storage aka fabric -attached storage. The storage servers are distributed across multiple servers with NVMe-based all-flash storage devices, all connected via fabric. This scalable architecture supports multiple data centers as a single storage domain to scale the storage needs across the globe. The user of this architectures get advantage of independently scaling the Network attached Storage (NAS) heads (performance), and additional storage can be attached without forklift upgrade of NAS heads.

The NVMe over Fabrics architecture supports NFS (Network File System) or CIFS (Common Internet File System) or block storage over iWARP (Internet Wide Area RDMA Protocol) or ROCEV2 (RDMA over Converged Internet) to transfer data from application servers to storage servers. The storage servers are distributed across multiple servers in a scale-out architecture. The storage devices are connected to storage servers via fabric. The fabric technology is implementation- dependent and could support PCIe, Ethernet, Converged Ethernet, Fibre Channel or Infiniband.  The storage services can run on storage devices (aka target devices) to be accelerated in hardware.  The services are configurable based on the end user, and the capacity of the service is use-case dependent.


Figure 2:  NVMeOFabrics concept


Xilinx provides its SDAccel Development Environment which supports C/C++/OpenCL language as input. The toolset converts this input file format to RTL (Register Transfer Level in Verilog or VHDL), and Xilinx’s Vivado® Design Suite converts the data to bit stream that gets downloaded into the FPGA. The bit stream configures logic functions in the FPGA.

Figure 3 shows the SDAccel and Vivado toolset with its partial reconfiguration flow along with the shell and role model for configuring/reconfiguring storage services in Xilinx FPGAs.


Figure 3: SDAccel with Vivado and partial reconfiguration with storage services


The purpose of the Xilinx partial reconfiguration flow is to implement and reconfigure a portion of the FPGA on the fly while the rest of the FPGA is still running other functions. Making use of this partial reconfiguration flow, Xilinx supports the industry-wide shell and role model for reconfigurability, where shell includes connectivity such as PCIe, NVMe controllers, DDR memory controller, NVMe over Fabrics module etc. The shell is always on vs the role of the FPGA which is design-dependent. In this case, we ran the experiment of compression/de-compression as an example. The role has standard AXI interfaces so that various IP can come from different sources. This type of hardware acceleration has shown performance benefits in excess of 100x compared to storage services implemented in software on processors.  The algorithm implemented in hardware was LZ77 + Huffman code and compared w GZIP in X86 Processor.


The diagram below shows the latency numbers of hardware vs software. As one can see, the hardware is approx 100x faster than software implementation. The algorithms are implemented in C/C++/OpenCL as shown with our SDAccel tool flow with partial reconfiguration. The red line is for processor vs blue line is for hardware based accelerator that is fairly constant based on file size.


Figure 4:  Performance comparison of X86 with FPGA


As shown earlier, comp/de-comp was the first service implemented inside Xilinx’s solution.

With Xilinx’s SDAccel flow, coupled with partial reconfiguration and connectivity interfaces like PCIe, memory controllers, Ethernet MACs, NVMe controllers, you can use the shell and role reconfigurability model with Xilinx FPGAs to implement hardware accelerators in the NVMe over Fabrics architecture. This fabric-attached storage with scalable performance and highly efficient platform for analytics provides much lower total cost of ownership for cloud service providers.

Future areas will explore porting other storage services like security, matrix multiplication, Spark machine-learning (ML) libraries acceleration for analytics and de-duplication in hardware accelerators in scale out storage architectures.


About the Author

Shreyas shah has over 20 years of experience in designing of chips and systems. He started his career at Alantec/FORE/Marconi communications in Mid 1990s as  networking chip designer. Later he moved to various startups, Sisilk Networks, Fabric7 systems and Xsigo systems where he held position of CTO and architect. Shreyas is with Xilinx for last 6+ years, joined as wired architect and currently holds Principal Data Center architect position.