Maxta - A Cautionary Tale of Hyperconverged Storage Gone Wrong

In my recent post, Pitfalls of Hyperconverged Storage, I mentioned that I had enjoyed some great successes with hyperconvergence, but also some spectacular failures.

My experience with Maxta was without doubt a spectacular failure, and their recent announcement that they are to adopt a freemium model, got me once again thinking about the catastrophic mess we found ourselves in with Maxta.

In March 2015 I was looking at potential hyperconverged storage vendors for an upcoming VMware cloud platform refresh. We had just run much the same project for a Hyper-V cloud refresh using ScaleIO, and buoyed by our success, but not keen to accept EMC's suggestion that we pay double the price we had 12 months ago, we decided to look again at the market.

One of the companies I reached out to was Maxta. On paper they had an interesting proposition - a Nimble-esque hybrid flash/hdd solution that utilized inline compression and deduplication. Within a few days of engaging them, I was given remote-access to their POC lab environment, alongside an extremely aggressively priced quote.

Initial impressions were mostly positive. As you'd expect, read performance was excellent, and whilst the write performance was unspectacular, it was good enough to want to see Maxta running on our own kit. The only real issue we picked up at this point was that write performance was a little unpredictable at times, something Maxta assured us they were already aware of and had already fixed in an upcoming new release.

There would be no direct upgrade path to this new release though, and as we had a write heavy workload (approx. 70% writes), we put Maxta to one side whilst we looked at other vendors.

Fast-forward to August 2015 and we were given a pre-release copy of the delayed, but now imminently available, new version to install and test on our hardware. The expectation being that by the time we had tested it thoroughly, the release would have reached general availability.

After a few of weeks of testing, we uncovered some issues with the new Maxta release, the climax of which was the catastrophic failure of the Maxta cluster, and the loss of all the data. Far from ideal, but we accepted that this was a pre-release version, and since the data was all test data, we had lost nothing.

Maxta's tech team were keen to try and understand what had gone wrong, so we gave them all the relevant log files, and they went away and spent a good couple of weeks analysing them whilst we looked elsewhere.

Maxta later came back to us. From log analysis they had identified a number of serious issues with the new release that would take some time to fix. They were keen for us to see the product working reliably though so they recommended that we re-test the version we had tested back in March on our own hardware. Their assertion being that the old version was "Rock-Solid" and that the beefier specification of our nodes compared to those in the Maxta POC lab would smooth off the write performance unpredictability.

In retrospect, we should have walked away at this point, but blinded by the commercial attractiveness of the deal, we decided it couldn't do any harm to have another look.

So we tested the old version on our hardware, and it was very impressive. Read performance was blistering and write performance was much improved. Crucially, during testing we were not able to recreate the unpredictability when it came to write performance, even under heavy load.

In fact, the only issue we had with the product at this point, was that deduplication wasn't actually supported. (Despite being listed as a key product feature on Maxta's website.)

But we could live without dedupe for now, so in November 2015 when Maxta told us they were about to change their licensing model from per TB to per Node/Socket, (which would have made things somewhat more expensive in our case), it was decided to go ahead and buy it while it was comparatively cheap.

Satisfied with everything we had seen up to date with the POC, and with it already installed on the hardware that would be the start of our new cluster, we started storage vmotioning VMs over to Maxta.

Everything was looking good. We had 200 VMs running happily on the platform (about 1/5th of what we would need initially) but then, as another storage vmotion was in progress, the Maxta datastore went into read-only mode without warning. All the VM's went down and Maxta's US-based 24/7 support were not contactable by any means.

After two hours of all the VMs being down, and with us still unable to get in contact with anyone at Maxta, we decided we would try rebooting all the nodes in the cluster.

Rebooting brought the Maxta datastore back online, but whilst Maxta's technical team spent the next couple of weeks trying to understand what had happened, we endured living with a storage platform that would periodically go read-only. (Ideally, we would have migrated the VMs off after the first incident but it was felt that the increased load of storage vmotions on the already fragile storage system could make the frequency of the issue more common.)

Eventually, it was identified that the LSI storage controller in one of the cluster nodes was periodically resetting due to a bad firmware version. Maxta is designed to withstand node failure (and during testing had been able to do so), but under a load of 200 active VMs, it was unable to flush the write cache to disk quickly enough in a node failure scenario, causing the filesystem to become read-only.

The controller was updated to the necessary firmware version which fixed the controller reset issue, and as a temporary fix, Maxta setup a cron job to aggressively flush the write cache. (This seemed to have a negative effect on performance but made the platform stable enough to vmotion the VMs off it at least.)

Whilst investigating the issue further, Maxta also noticed that their installer had failed to setup metadata caching correctly, and concluded that the only way to fix this would be a complete re-install.

By this point, Maxta had released their new version, and so strongly recommended to us that we install it, insisting that all the issues we'd seen some months earlier with the pre-release had been resolved, along with the issues we'd been experiencing with the previous version. (There was still no dedupe support though)

The new version was production ready we were assured, so suspecting that the old version couldn't meet our needs, we reluctantly agreed to give the new version a go.

We installed the new version and did some testing. Performance was significantly down on the previous version, but Maxta told us they had deliberately sacrificed performance to ensure stability. (The performance would come later they said.)

So we started to trickle some VMs over to the platform. We only got to around 20 VMs this time before things started to go badly wrong. Maxta controller VMs starting randomly rebooting, seemingly due to resource exhaustion. This resulted in IO performance and latency becoming erratic in the extreme. At times it would take the entire Maxta filesystem offline, other times it would be limited to just a subset of VMs. (There were other issues too, albeit less critical.)

For the next month we tolerated the pain of a barely usable storage platform whilst we gave Maxta a chance to fix the critical issues, but they couldn't fix them. As with the previous version, migrating VMs off the platform was difficult due to the instability of the platform, and to add insult to injury, the Maxta filesystem was now massively over-reporting VM disk space utilization to vCentre, which meant that target datastores needed to be 10-20x bigger than the size of the VM in order to storage vmotion off Maxta.

So feeling that we had exhausted all our options, in March 2016 we pulled the plug, and asked Maxta for our money back. In our opinion, the product was clearly not fit for purpose, as the myriad of issues we experienced over the previous 6 months had proved. Maxta declined, arguing that all software has bugs, and that they would be fixed in time.

A painful and expensive lesson learnt for us from trying to do things on the cheap.

A disappointing lack of awareness from Maxta regarding a customer's basic expectations of a storage platform.

It seems this post has caused somewhat of a storm. I have written a follow up post that can be found here.