Skip to main content
Pure1 Support Portal

Stale Zip Processes Keeping log_space_watchdog From Reclaiming Space on the Boot Drive in /var/log

Problem

Stale zip processes can cause the /var/log partition to fill up, because log_space_watchdog is unable to reclaim space on the boot drive, because of stale zip processes that are holding on to uncompressed log files. PURE-33421 is the relevant bug for this; this is more likely to occur at dark sites or sites that are having phonehome problems that are causing the phonehome process to crash repeatedly.

Fundamentally, Purity's log space watchdog should be deleting old log files when the boot drive utilization is getting full.  The log space watchdog is basing judgment on the total size of files in /var (which is generally dominated by /var/log).  However, the stale zip processes hold the filehandle of the deleted log files, consuming disk space, but do not contribute to the /var size computation.  This mismatch can prevent the log space watchdog from deleting log files even when the boot drive utilization is high.

Impact

Uncorrected, the /var/log partition will fill up and generate the Boot drive utilization exceeds recommended space limit alert, which is probably how you noticed this problem to begin with.  Due to stale zip processes holding on to filehandles of deleted logs, Purity's log watchdog does not actively delete logs to free up space.  If the workaround is not applied the /var/log partition will eventually fill up and the array will stop logging, making troubleshooting future issues far more difficult.

If this symptom shows up on one controller, it typically also manifests itself on the other controller as well.  So check both controllers!

Solution

To confirm this pathology:

  1. Look at the latest log_space_watchdog.log available:

     43 Jun 12 16 23:15:01 [LogSizeTrim:WARNING]  No trimmable space: desired_free: 5.65 GB min_logs: 34.14 GB trimmable: 0 Bytes. Check /var usage.
     44 Jun 12 16 23:15:01 [LogSizeTrim:WARNING]  {"/var/home": "140K", "/var/backups": "632K", "/var/lib": "84M", "/var/cache": "4.7M", "/var/log": "37G", "/var/spool": "948K"}
    

    Note that in this case, log_space_watchdog sees 37 GB of space used in /var/log, which dominates the size of /var.  Note: this line may not exist if the boot drive is not very full; if so, go to step 2.

    If you look at the df -hl entry in the diagnostics log, you see something quite different:

  2. ------------------------------------------------------------------------
    Jun 12 23:17:07 df -hl
    ------------------------------------------------------------------------
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sda3        11G  3.6G  6.2G  37% /
    udev            126G  4.0K  126G   1% /dev
    tmpfs            26G  1.8M   26G   1% /run
    none            5.0M   16K  5.0M   1% /run/lock
    none            126G   12K  126G   1% /run/shm
    none            100M     0  100M   0% /run/user
    /dev/sda2       1.9G   27M  1.7G   2% /boot
    /dev/sda7       3.3G  6.2M  3.1G   1% /cache
    /dev/sda1       487M  3.4M  483M   1% /boot/efi
    /dev/sda5       114G  103G  5.7G  95% /var

So we see that df reports 103 GB used for /var, but log_space_watchdog only sees 37 GB used. This discrepancy is your smoking gun for PURE-33421, and is why log_space_watchdog isn't reclaiming space like it should.

2. Look to see how many stale zip processes we have:

ryan.boudwin@i-5144828a:/support/217280/rpure08-ct0/2016_06_12$ zgrep "/usr/bin/zip -9 --junk-paths --encrypt --password 0smium76" diagnostics.log-2016061223.gz | head
root     14490  1.5  0.0   8772   952 ?        S    23:17   0:00  |                   \_ /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root      3344  0.0  0.0   8772  1208 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     48224  0.0  0.0   8772  1212 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     15932  0.0  0.0   8772   952 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     24596  0.0  0.0   8772  1212 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     13400  0.0  0.0   8772  1208 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root      1066  0.0  0.0   8772  1208 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     12534  0.0  0.0   8772   956 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     30177  0.0  0.0   8772   956 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
root     10435  0.0  0.0   8772  1212 ?        S     2015   0:00 /usr/bin/zip -9 --junk-paths --encrypt --password 0smium76 - -@
ryan.boudwin@i-5144828a:/support/217280/rpure08-ct0/2016_06_12$ zgrep "/usr/bin/zip -9 --junk-paths --encrypt --password 0smium76" diagnostics.log-2016061223.gz | wc -l
1099

In this case we have 1099 stale zip processes that we will need to clear to resolve this issue.

The more stale processes that appear, the more (negatively) impactful PURE-33421 will be.  The impact is likely negligible if there are less than 10 stale processes (there are exceptions).

3. Create a JIRA to verify your findings.

Workaround:

As of 6/14/2016 there is no permanent fix for this issue. As a workaround, we have to kill these stale zip processes by doing the following:

for x in `lsof | grep deleted | grep "^zip" | awk '{print $2}'`; do kill -9 $x; done

Do not run the above workaround without PSE approval. Killing processes is potentially dangerous and we need to be very cautious in these cases.

Example escalations:

ES-14037 @Workday - dark site

ES-16392 @Bloomberg - dark site