Since mid november and my last blog post we have been moving on in small steps, both internal and external resources was a bit limited during this period, vacations and holidays making things even worse.
But, finally on January 13th, we began moving “real” VM’s over to the Nutanix environment. So far we have 32 VM’s in the cluster, 15 being large SQL server between 0.5 and 1.7 TB in size. Most of them with all databases PAGE compressed, the highest level of compression in SQL server, and we still have reached 1.8 : 1 in compression ration in Nutanix. There is still very limited load on most of these servers so performance is still to be evaluated.
The cluster is now made up of eight 6060 nodes with two containers for VM’s. Both running RF3 and one is compressed and the other one uncompressed. Reason for RF3 is that we don’t run any backup’s on the Nutanix environment yet and we don’t have any replications of snapshots to any other cluster. Snapshot have been setup daily with 7 days retention.
Our impression when we dediced to go with Nutanix, was that node and block awareness and RF2 would be sufficient for our initial demands since this is a POC environment for only test/dev VM’s. What we didn’t know was that block awareness was only “best effort” while running RF2. It was a big surprise to us, when the PRISM GUI started to report that we did not have block awareness after big loadtests and while copying large VM’s and files into the Nutanix environment.
To feel confident with the environment we reinstalled the cluster with RF3 and created new containers. As long as we had a mix of RF2 and RF3 containers in the storage poo,l we continued to get alerts about block awareness not being available. Last week we migrated the last VM from the RF2 to the RF3 container and removed the RF2 container, after that we don’t have any alerts about block awareness any longer.
I’m not sure if this would have been a big problem to stay on RF2, Nutanix says that most customers are running RF2 and that no data has never been lost for any customer. Losing both nodes in a block is probably very unlikely. What worried us most was the fact that if a VM runs in a block with two nodes and both fail, it could potentially start-up on a third node without having access to all of it’s blocks. A normal windows VM with a block missing in a windows DDL file would maybe get a blue screen. But if it was a SQL server, a block might be missing within a database and that would not be noticed until someone accessed it.
Administrators would of course know if a node or two was lost and could take action to limit access or not startup VM’s from the failing nodes but it is still an interesting scenario to think about.
We will reevaluate the design and probably only run RF2 in any coming implementations, we would probably have cross cluster replication of snapshots and external backup in the future. It would also be possible to only use one node in each block for a certain cluster and by that achieving block awareness even with RF2, but that was not possible in this implementation.
The only drawback with RF3 is 50% more storage needed. We have run a lot of performance test with both syntetic test tools and also with ETL jobs in SQL servers and have seen no performance difference between RF2 and RF3. When load on the server gets higher, writing to three nodes instead of two will of course cause additional network traffic between nodes, future will tell if this will mean anything negative.
One of the first servers we cloned to Nutanix for testing was a SQL Server 2012 warehouse server with a home made stored procedure based ETL load process. This ETL load consists of many single threaded SQL insert and updates and the CPU speed becomes very important to get a high speed load. In this case we moved the VM from a three year old HP blade with 3,07 Ghz CPU’s to a 6060 node with only 2,8 Ghz CPU’s. We immedtalty saw a 10-15 % increase in load time because of this, but we also saw an additional 10-15 % increase in load time that we initially couldn’t understand. Nutanix helped out during several days to try and understand what was causing this performance drop. We started by implementing all best practices fro SQL servers in Vmware, went on with even more changes recommended by Nutanix, none of the making any major difference.
A performance monitor setup were we added transaction log flushes/sec, finally made me realize we where writing 100000-200000 small 512 b blocks to disk /sec during execution of some specific stored procedures. I could also see that the time for each write was far higher in Nutanix then on the old NetApp environment.
Finally we could pinpoint a cursor based insert and update taking place in the stored procedure. Cursor based inserts or updates are considered bad practice in the SQL community and should be avoided as much as possible. In this case just adding a BEGIN TRANSACTION and COMMIT TRANSACTION in the beginning and end of the stored procedure made the transaction log flushes/sec go down to about 100/sec and every write to be 64kb in size. And by doing so we could eliminate the 10-15 % increase in load time.
But why did it run slower in Nutanix than in NetApp ? First I though I found something that would make the load process be faster also in the original server, still running in NetApp, but to my surprise it did not decrease the load time at all in the NetApp environment.
The reason for this is probably that in NetApp NVRAM is used to acknowledge writes from clients and in Nutanix it takes place on SSD disks. When writes occur as such high frequency as in a the cursor based update, this becomes a bottleneck. And since the cursor in this case updated billions of rows it became a huge problem.
I have already started part 4 and part 5 of the Nutanix journey, stay tuned for more info.
* More performance comparison of moved VM’s
* Migrating VM’s to Nutanix might be slow…. vMotion has limitations. Could we use a third party tool like Veeam replication ?
* What about backups ? Is snapshots enough ? What about SQL backup’s and single db restores ? Integration with NetBackup and are there alternatives ?
* Monitoring of Nutanix , how to monitor in Nagios ?