HPE, Simplivity, FPGAs and VM Contagion

HPE's announcement that they have agreed to acquire Simplivity has filled a fair few column inches today.

On paper it looks like a good move to me, but what I've found most interesting when reading articles and commentary regarding the acquisition, are the below the line comments criticising one of the key architectural decisions of Simplivity - the offloading of compression and dedupe to FPGAs.

Which brings me back to a point I raised in my post, pitfalls of hyperconverged storage back in October. Specifically, the issue of VM Contagion. (Whereby CPU contention caused by running VMs can negatively affect the performance of a hyperconverged storage system.)

Whilst talking about this issue, I remarked that Simplivity's FPGA approach looked to provide an interesting solution. (I stopped short of praising it beyond this point on the basis that I've never used Simplivity myself)

Which dovetails quite nicely into a follow-up post that I had half written back in November. (Before I became swamped by dirty nappies)

After writing my original post, it got me thinking in more detail about the issue of VM Contagion. Whilst I was sure this was a genuine issue that I'd seen with my own eyes in the past, the only evidence I had was anecdotal.

I wasn't really happy with anecdotal evidence, so I decided I would do some synthetic testing in the lab to see if I could recreate the effect and then measure the extent to which it might be a factor.

The test bed was a 4-node vSphere cluster, with each host equipped with dual Intel E5-v3 CPUs, 128GB RAM, and two enterprise SATA SSDs. On top of this, I installed a hyperconverged storage product, with inline compression enabled and dedupe disabled. (In fairness to the vendor, I'm choosing not to name the hyperconverged storage product I used in this test as I don't think this issue is unique to their product.)

Aside from any VM's I provisioned for testing and the storage controller VM's, the cluster was empty.

So my first task was to see if I could recreate the effect under controlled conditions.

I started by taking a baseline of the disk performance. I built a VM running Windows Server 2012 R2 on Host 1, (DiskTest-VM1), and ran a disk IO test using Microsoft's DiskSpd utility.

I arbitrarily chose test parameters that I felt would work the storage, without pushing it to it's limit. (Random IO, 70% writes, 8K blocks)
The end result was a respectable 26,638 IOPS with 2.4ms average latency.

I then built a second Server 2012 R2 VM, this time on Host 2. (CPUTest-VM2) I gave this VM the same number of logical cores as there were in the host, and used multiple instances of CPUStres to keep the VM's CPU usage at 50%. (The idea being to crudely simulate 50% CPU usage on Host 2)

I reran the original test on DiskTest-VM1, and this time the same test returned 23,960 IOPS, with 2.67ms average latency - roughly 10% fewer IOPS and 10% longer latency. Perhaps not a conclusive result, but certainly worthy of further investigation.

So I dialled up the CPU usage on CPUTest-VM2 to 75% and reran the disk IO test on DiskTest-VM1.
This time the difference was more significant, IOPS were down to 18,259, and latency was up to 3.5ms - 35% fewer IOPS and 45% longer latency compared to the baseline.

Confident that I had been able to recreate the effect, I then stopped all the CPUStres instances on CPUTest-VM2 and ran the test on DiskTest-VM1 again. The result was, as expected, nearly identical to the baseline test - 26,464 IOPS with 2.41ms latency.

This had really piqued my interest. If 75% CPU load on a single host could have this dramatic an impact, what CPU load would I have to apply across the cluster before I noticed a reduction in performance?

So, I extended my original test to allow me to apply a uniform CPU load across the whole cluster. I built an additional 3 VM's (CPUTest-VM1, CPUTest-VM3, CPUTest-VM4). CPUTest-VM3 and CPUTest-VM4 were configured as per CPUTest-VM2, and placed on Hosts 3 and 4. CPUTest-VM1 however was built with half the number of logical cores, since it would be running on Host 1 with DiskTest-VM1. (Ideally, I would have had a compute only host from which I could run DiskTest-VM1 but that wasn't possible at that time)

Back on DiskTest-VM1, I slightly reduced the aggressiveness of the Disk IO test, whilst retaining the core parameters from before. (Random IO, 70% writes, 8K blocks.)
I then ran the test a number of times, increasing the CPU load on every host in the cluster in 12.5% incremements, starting at 0% and going all the way up to 100%. (Except for Host 1, where the increments were 6.25% and the limit was 50%)

I recorded the IOPS and average latency at the end of each test, from which I was able to create the following graph.

Granted, this was a fairly crude test knocked up by me in an afternoon, but I think the results show it's definitely something that you need to consider when looking at hyperconverged storage.

As always with these things, your mileage may vary. Inline dedupe I'm sure would magnify this problem further. (If I get time I will run the same test with dedupe enabled to see to what extent that is true)

Which brings us in a round about way back to Simplivity and FPGAs. In my view, FPGAs are especially relevant with respect to the future evolution of hyperconverged infrastructure, so there is clear merit to this approach architecturally speaking. I have no idea how well Simplivity's product works today, but what I do know is that HPE have the resources and the expertise to get it where it needs to be.