ESXi Root Partition Full inode Table
One of my customers recently experienced a strange issue. One of their ESXi hosts had entered a problem state where Storage vMotion and vMotion were failing for all VMs on the host (vMotion was failing at 13%, which is an interesting spot). We initially noticed the issue when Storage vMotion repeatedly threw an error for one of their VMs:
A general system error occurred: Failed to create journal file provider: Failed to open "/var/log/vmware/journal/..." for write: There is no space left on the device.
Well, that error seemed self explanatory. I connected to the host's CLI (SSH was already enabled, which made that easier) and did a "vdf -h" to look at its file system. I was surprised to find that none of the partitions were full, so I dug deeper.
I decided to take a look at the vmkwarning log file, which is frequently a gold mine when troubleshooting ESXi host issues. So, I did a quick "tail /var/log/vmkwarning.log" and, lo and behold, we had many repeating errors that were slightly different from what vCenter showed me:
... Cannot create file /var/run/vmware/tickets... ...for process hostd-worker because the inode table of its ramdisk (root) is full.
So, it looks like the issue wasn't that the file system itself was full, instead the file system's inode table was. Some google-fu brought me to the command "localcli system visorfs ramdisk list" which showed that I had 8192 maximum inodes on my root partition... and 8192 allocated inodes... and 8192 used inodes. There's confirmation of the issue; out of inodes. Now, I needed to find the source of the problem.
So, I dug around to learn more about the inode tables and came across Michael Albert's post on a nearly identical issue! Michael used a nice easy technique to figure out which folder had the excessive number of files (eating up all of the available inodes): find <folder> | wc -l
That command will list the number of files under the specified directory. I did find /var | wc -l and found that there were over 4200 files in my var directory! If I understand things correctly, those files were consuming over half of my total 8192 inodes. So, I dug deeper, looking at the subfolders under /var to see if one of them had an absurd number of files in it. Eventually, I got to /var/run/sfcb as the culprit.
What the heck is that folder? Well, some more research pointed me at this VMware KB Article about SFCB exhausting all of the system's inodes. So, SFCB is the hardware monitoring service. It's important, but it's not a critical service... and that same KB article nicely describes a process for stopping the service and clearing out the excessive files.
It's worth noting that the process described in the KB article has you delete all of the files in the sfcb folder, but it breaks it out into several deletion commands. For example rm [0-2]* will delete all files that begin with a 0, 1 or 2. "Why break it up so?" I wondered. This was a production system, so didn't want to deviate from the established procedure, but eventually the problem became evident. When I issued the rm [3-6]* command, I got back an error:
-sh: rm: Argument list too long
Apparently, I had a lot of files that started with 3-6. So, apparently rm can't delete an infinite number of files at once and the file naming wildcards were there to limit the number of files being passed to rm. I just reduced the scope of my command rm [3-4]* and made some progress... until I got to the 5s. We had a lot of 5* files; so I had to get even more granular. In my case, I ended up running rm 52[0-6]* and then rm [3-6]* in order to delete them all in two passes. After that, I completed the rest of the deletes as per the document.
So, I went through that process. After completing it and restarting the management agents, all of my management services (including vMotion!) were working. Out of curiosity, I decided to check the file count from the /var/run/sfcb directory after it'd been running for about an hour: it only had 468 files in it. So, the original 4000+ files was certainly excessive!
A general system error occurred: Failed to create journal file provider: Failed to open "/var/log/vmware/journal/..." for write: There is no space left on the device.
Well, that error seemed self explanatory. I connected to the host's CLI (SSH was already enabled, which made that easier) and did a "vdf -h" to look at its file system. I was surprised to find that none of the partitions were full, so I dug deeper.
I decided to take a look at the vmkwarning log file, which is frequently a gold mine when troubleshooting ESXi host issues. So, I did a quick "tail /var/log/vmkwarning.log" and, lo and behold, we had many repeating errors that were slightly different from what vCenter showed me:
... Cannot create file /var/run/vmware/tickets... ...for process hostd-worker because the inode table of its ramdisk (root) is full.
So, it looks like the issue wasn't that the file system itself was full, instead the file system's inode table was. Some google-fu brought me to the command "localcli system visorfs ramdisk list" which showed that I had 8192 maximum inodes on my root partition... and 8192 allocated inodes... and 8192 used inodes. There's confirmation of the issue; out of inodes. Now, I needed to find the source of the problem.
So, I dug around to learn more about the inode tables and came across Michael Albert's post on a nearly identical issue! Michael used a nice easy technique to figure out which folder had the excessive number of files (eating up all of the available inodes): find <folder> | wc -l
That command will list the number of files under the specified directory. I did find /var | wc -l and found that there were over 4200 files in my var directory! If I understand things correctly, those files were consuming over half of my total 8192 inodes. So, I dug deeper, looking at the subfolders under /var to see if one of them had an absurd number of files in it. Eventually, I got to /var/run/sfcb as the culprit.
What the heck is that folder? Well, some more research pointed me at this VMware KB Article about SFCB exhausting all of the system's inodes. So, SFCB is the hardware monitoring service. It's important, but it's not a critical service... and that same KB article nicely describes a process for stopping the service and clearing out the excessive files.
It's worth noting that the process described in the KB article has you delete all of the files in the sfcb folder, but it breaks it out into several deletion commands. For example rm [0-2]* will delete all files that begin with a 0, 1 or 2. "Why break it up so?" I wondered. This was a production system, so didn't want to deviate from the established procedure, but eventually the problem became evident. When I issued the rm [3-6]* command, I got back an error:
-sh: rm: Argument list too long
Apparently, I had a lot of files that started with 3-6. So, apparently rm can't delete an infinite number of files at once and the file naming wildcards were there to limit the number of files being passed to rm. I just reduced the scope of my command rm [3-4]* and made some progress... until I got to the 5s. We had a lot of 5* files; so I had to get even more granular. In my case, I ended up running rm 52[0-6]* and then rm [3-6]* in order to delete them all in two passes. After that, I completed the rest of the deletes as per the document.
So, I went through that process. After completing it and restarting the management agents, all of my management services (including vMotion!) were working. Out of curiosity, I decided to check the file count from the /var/run/sfcb directory after it'd been running for about an hour: it only had 468 files in it. So, the original 4000+ files was certainly excessive!
Thanks for the pointer on finding the folder that contained a large number of files!
ReplyDeleteI had a similar issue with some errant process writing *.trp files to /var/spool/snmp on my ESXi 5.5 host. I ran into the "rm" command limitation as well trying to delete the 4580 files, and used this command that my Linux guru colleague gave me to remove them all, which I ran while in the /var/spool/snmp folder:
find . -maxdepth 1 -name "*.trp" -print0 | xargs -0 rm -f
It passes each result from the find command to the "rm" command as a parameter. I had to add the -f because "rm" complained about incorrect syntax without it.
I haven't tracked which process yet, will likely just restart management agents to see if that does the trick to stop this.