Nutanix journey – part 3

Since mid november and my last blog post we have been moving on in small steps, both internal and external resources was a bit limited during this period, vacations and holidays making things even worse.

But, finally on January 13th, we began moving “real” VM’s over to the Nutanix environment. So far we have 32 VM’s in the cluster, 15 being large SQL server between 0.5 and 1.7 TB in size. Most of them with all databases PAGE compressed, the highest level of compression in SQL server, and we still have reached 1.8 : 1 in compression ration in Nutanix. There is still very limited load on most of these servers so performance is still to be evaluated.

The cluster is now made up of eight 6060 nodes with two containers for VM’s. Both running RF3 and one is compressed and the other one uncompressed. Reason for RF3 is that we don’t run any backup’s on the Nutanix environment yet and we don’t have any replications of snapshots to any other cluster. Snapshot have been setup daily with 7 days retention.

Our impression when we dediced to go with Nutanix, was that node and block awareness and RF2 would be sufficient for our initial demands since this is a POC environment for only test/dev VM’s. What we didn’t know was that block awareness was only “best effort” while running RF2. It was a big surprise to us, when the PRISM GUI started to report that we did not have block awareness after big loadtests and while copying large VM’s and files into the Nutanix environment.

To feel confident with the environment we reinstalled the cluster with RF3 and created new containers. As long as we had a mix of RF2 and RF3 containers in the storage poo,l we continued to get alerts about block awareness not being available. Last week we migrated the last VM from the RF2 to the RF3 container and removed the RF2 container, after that we don’t have any alerts about block awareness any longer.

I’m not sure if this would have been a big problem to stay on RF2, Nutanix says that most customers are running RF2 and that no data has never been lost for any customer. Losing both nodes in a block is probably very unlikely. What worried us most was the fact that if a VM runs in a block with two nodes and both fail, it could potentially start-up on a third node without having access to all of it’s blocks. A normal windows VM with a block missing in a windows DDL file would maybe get a blue screen. But if it was a SQL server, a block might be missing within a database and that would not be noticed until someone accessed it.

Administrators would of course know if a node or two was lost and could take action to limit access or not startup VM’s from the failing nodes but it is still an interesting scenario to think about.

We will reevaluate the design and probably only run RF2 in any coming implementations, we would probably have cross cluster replication of snapshots and external backup in the future. It would also be possible to only use one node in each block for a certain cluster and by that achieving block awareness even with RF2, but that was not possible in this implementation.

The only drawback with RF3 is 50% more storage  needed.  We have run a lot of performance test with both syntetic test tools and also with ETL jobs in  SQL servers and have seen no performance difference between RF2 and RF3. When load on the server gets higher, writing to three nodes instead of two will of course cause additional network traffic between nodes, future will tell if this will mean anything negative.

One of the first servers we cloned to Nutanix for testing was a SQL Server 2012 warehouse server with a home made stored procedure based ETL load process. This ETL load consists of many single threaded SQL insert and updates and the CPU speed becomes very important to get a high speed load. In this case we moved the VM from a three year old HP blade with 3,07 Ghz CPU’s to a 6060 node with only 2,8 Ghz CPU’s. We immedtalty saw a 10-15 % increase in load time because of this, but we also saw an additional 10-15 % increase in load time that we initially couldn’t understand. Nutanix helped out during several days to try and understand what was causing this performance drop. We started by implementing all best practices fro SQL servers in Vmware, went on with even more changes recommended by Nutanix, none of the making any major difference.

A performance monitor setup were we added transaction log flushes/sec, finally made me realize we where writing 100000-200000 small 512 b blocks to disk /sec during execution of some specific stored procedures. I could also see that the time for each write was far higher in Nutanix then on the old NetApp environment.

Finally we could pinpoint a cursor based insert and update taking place in the stored procedure.  Cursor based inserts or updates are considered bad practice in the SQL community and should be avoided as much as possible. In this case just adding a BEGIN TRANSACTION and COMMIT TRANSACTION in  the beginning and end of the stored procedure made the transaction log flushes/sec go down to about 100/sec and every write to be 64kb in size. And by doing so we could eliminate the 10-15 % increase in load time.

But why did it run slower in Nutanix than in NetApp ? First I though I found something that would make the load process be faster also in the original server, still running in NetApp, but to my surprise it did not decrease the load time at all in the NetApp environment.

The reason for this is probably that in NetApp NVRAM is used to acknowledge writes from clients and in Nutanix it takes place on SSD disks. When writes occur as such high frequency as in a the cursor based update, this becomes a bottleneck. And since the cursor in this case updated billions of rows it became a huge problem.

I have already started part 4 and part 5 of the Nutanix journey, stay tuned for more info.

* More performance comparison of moved VM’s

* Migrating VM’s to Nutanix might be slow…. vMotion has limitations. Could we use a third party tool like Veeam replication ?

* What about backups ? Is snapshots enough ? What about SQL backup’s and single db restores ? Integration with NetBackup and are there alternatives ?

* Monitoring of Nutanix , how to monitor in Nagios ?





This entry was posted in nutanix, Performance, SQLServer. Bookmark the permalink.

11 Responses to Nutanix journey – part 3

  1. Nick Robledo says:

    So glad I found this article. I’m testing a NX3060 block with ESXi 5.5.0 and am experiencing similar performance issues with my SQL ETL load testing. I’m comparing performance against a Dell PowerEdge 2950 and EqualLogic PS4100 (1gig iSCSI), and the results (load time) are the same on both systems. I was expecting a significant improvement on Nutanix since it is newer hardware and SSD disks. I’ve been performing disk IO tests (IOMeter) and noticed that write speeds on Nutanix are not much faster (if any) than they are to my EqualLogic, so I thought that was my issue.

    I’m going to get with my DB admin this week and have him review the cursor based insert and update section. I’m hoping making an adjustment there helps. Do you have any other suggestions other than the best practices guide for SQL (by Nutanix)?

  2. I am following this with great interest as I will be testing a fairly heavy SQL database application in a Nutanix cluster in the next couple of weeks. The workload is primarily OLAP which should perform fairly well with its heavy reads.

    I will post any interesting information I come across.

  3. gingerdazza says:

    Hi SQLSOS… I’d be very interested to find out if you ever resolved your SQL performance issues with Nutanix as we’re looking at Nutanix right now. Really would appreciate your comments.

    • sqlsos says:

      Hi, we are still working to get more and faster nodes in place. when that is done I will do a new performance test with the ETL job. In short, our environment was not baselined correctly and was to to small for our working set. But we have over 180 VM’s running in the cluster, many of them SQL servers, and it just works. I still believe Nutanix is a great product, but maybe not suited for everything.

      • gingerdazza says:

        Thanks. Useful info. We’re looking at 4x NX3060-G4 blocks to run 100 VMs, but most of these are generic VMs (AD, IIS, print servers. But some of them are going to be SQL). We did a cloud based pilot, setting up SQL VMs with multiple para-virtual drives on separate SCSI controllers and the tests we ran were good, but as you say it’s difficult to get a real world working set injected into a PoC.

      • gingerdazza says:

        …. do you have particularly large working sets?

      • sqlsos says:

        400 GB SSD’s got full because of to much changes, background processes couldn’t keep up with moving away data. ETL warehouses load was a bit too much.

  4. gingerdazza says:

    Thanks for the info…. will we get parts 4 and 5 soon? Certainly interested to see them.

  5. gingerdazza says:

    sqlsos… in hindsight, would you place SQL on a non Nutanix platform or just get bigger Nutanix nodes? Which node configs did you go for when you resized? Did you have separate nodes just for SQL?

    • sqlsos says:

      Hi, would still consider small to medium sized SQL vm for Nutanix. Would probably work great on the 6600 nodes with 400 GB SSD we originally got. Our changerate was to high so the SSD got full all the time. We are no upgrading to bigger SSD and addtitional 8150 nodes. We will probably pinn the big SQL servers to 8150 nodes. I will will write about it when everything up and running.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s