vCenter Server Appliance Crash due to Full /Storage/SEAT Partition
One of my customers recently had one of their vCenter 6 Server Appliances go offline. The VM was still running and responding to pings, but the service wasn't working. I established an SSH session to the server and went through the basics, and what do you know, "df -h" revealed that the /storage/seat partition was 100% full.
Well, VMware has a fine KB Article about a full seat partition and how to solve it. At least, mostly how to solve it. The problem that I ran into is that the truncate commands (that free up space) were failing to run because there wasn't enough space on the partition. When I tried to execute them, I got the following message:
"ERROR: could not extend file ... No space left on device"
"Hint: Check free disk space."
I'll admit to chuckling when I saw the "hint" line. So, I had to free up some disk space so that I could free up some disk space. I did a bit of research into how to free up some space on there, but the general feeling I got was "don't delete any of those files!" and I wasn't feeling overly brave... especially since I had another option. Rather than deleting files to make free space, I could simply expand the VMDK to accomplish the same goal. And, VMware has another fine KB article about how to grow a VCSA partition!
Of course, in order to follow that procedure, I had to track down the VCSA VM so that I could modify its hardware. That's not always easy without vCenter... but we know how to find vCenter when it's down (please forgive the shameless plug). So, after I grew the appropriate VMDK and extended the partition with the vpxd_servicecfg storage lvm autogrow command, I was able to finish the procedure in the first KB article by executing the truncate table vpx_event cascade; and truncate table vpx_event_arg cascade; database commands.
Once those completed, I restarted my vCenter services and everything came back up. That left me with diagnosing the actual problem. From the process in that first KB article, I knew that one of my ESXi hosts was generating a ridiculous amount of SEAT data (over 35 million "vim.event.GeneralHostWarningEvent" events and another 35 million "com.vmware.vc.StatelessAlarmTriggeredEvent" events).
Poking around at that host revealed a *bunch* of entries in \var\log\vmkernel:
"ALERT: vmsyslog logger <syslog server> lost X log messages"
And it turns out that there's another VERY applicable VMware KB article about this situation. That KB has one of those good news/bad news lines. This excessive logging behavior is a known issue that is resolved in an update. So, our next step is simply to update these servers! In the meantime, in my SSH session on the appliance, I've used truncate -s 2G /storage/seat/DeleteMe to create a 2 GB junk file that I can easily delete in case this issue comes back before we are able to apply updates to these hosts.
Well, VMware has a fine KB Article about a full seat partition and how to solve it. At least, mostly how to solve it. The problem that I ran into is that the truncate commands (that free up space) were failing to run because there wasn't enough space on the partition. When I tried to execute them, I got the following message:
"ERROR: could not extend file ... No space left on device"
"Hint: Check free disk space."
I'll admit to chuckling when I saw the "hint" line. So, I had to free up some disk space so that I could free up some disk space. I did a bit of research into how to free up some space on there, but the general feeling I got was "don't delete any of those files!" and I wasn't feeling overly brave... especially since I had another option. Rather than deleting files to make free space, I could simply expand the VMDK to accomplish the same goal. And, VMware has another fine KB article about how to grow a VCSA partition!
Of course, in order to follow that procedure, I had to track down the VCSA VM so that I could modify its hardware. That's not always easy without vCenter... but we know how to find vCenter when it's down (please forgive the shameless plug). So, after I grew the appropriate VMDK and extended the partition with the vpxd_servicecfg storage lvm autogrow command, I was able to finish the procedure in the first KB article by executing the truncate table vpx_event cascade; and truncate table vpx_event_arg cascade; database commands.
Once those completed, I restarted my vCenter services and everything came back up. That left me with diagnosing the actual problem. From the process in that first KB article, I knew that one of my ESXi hosts was generating a ridiculous amount of SEAT data (over 35 million "vim.event.GeneralHostWarningEvent" events and another 35 million "com.vmware.vc.StatelessAlarmTriggeredEvent" events).
Poking around at that host revealed a *bunch* of entries in \var\log\vmkernel:
"ALERT: vmsyslog logger <syslog server> lost X log messages"
And it turns out that there's another VERY applicable VMware KB article about this situation. That KB has one of those good news/bad news lines. This excessive logging behavior is a known issue that is resolved in an update. So, our next step is simply to update these servers! In the meantime, in my SSH session on the appliance, I've used truncate -s 2G /storage/seat/DeleteMe to create a 2 GB junk file that I can easily delete in case this issue comes back before we are able to apply updates to these hosts.
Hi Jason,
ReplyDeleteThanks for documenting this. I had a similar issue but when I tried to expand the vmdk file in gui, it failed with the error - “invalid configuration for device 0”. I had to SSH to the hosts and use vmkfstools -X to expand the drive. Once expanded the remaining of the journey was uneventful.
Regards
Rahul Stephen