September 25, 2014

Alert "Space in root exceeded 90%" on Data Domain Systems

Note that this alert is not the same as these alerts:

Space usage in Data Collection has exceeded 90%.
Space usage in /ddvar has exceeded 90%.

APPLIES TO

All Data Domain systems that are being monitored through the EMC Data Protection Advisor tool or through a custom script that logs into and out of the system excessively.
All Software Releases prior to 5.2.1.0.

PURPOSE

This article explains how to troubleshoot cases in which the customer receives the alert Space usage in root has exceeded 90%. as a result of excessive logins and logouts by a custom script or the DPA tool.

CAUSE

Every successful login and logout on a DDR is logged in the /var/log/wtmp file. This file is rotated monthly;  if there are a large number of logins and logouts over the course of a month this file can grow to become very large thus triggering this alert.  There is also a known bug that can cause root to fill (see "Separate potential cause" below).

SOLUTION

To determine if the login/wtmp file issue described herein is responsible for the alert:

Enter bash mode.

Check the space utilization in the root partition with the df –h command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/dd_dg00p15       2.0G  1.6G  377M  91% /
/dev/dd_dg00p14       5.0G  952M  3.8G  20% /ddr
/dev/dd_dg00p13        79G   29G   47G  38% /ddr/var
shm                   3.9G     0  3.9G   0% /dev/shm
/dev/dd_dg00p7         13G   13G     0 100% /ddr/col1/repl
localhost:/data       7.7T   21G  7.7T   1% /data
This alert triggers when the Use% on the partition Mounted on / exceeds 90%.

Check the size of /var/log/wtmp and its rotations with the ls –lh /var/log/wtmp* command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # ls -lh /var/log/wtmp*
-rw-rw-r--  1 root utmp 89M Aug  2 14:57 /var/log/wtmp
-rw-rw-r--  1 root utmp 92M Jul 31 21:06 /var/log/wtmp.1
If these files exceed 60M in size then they are almost surely the cause of the alert.

Investigate what is causing the log to fill up by dumping it:

!!!! dd630-rtp2 YOUR DATA IS IN DANGER !!!! # last -f /var/log/wtmp | less
sysadmin pts/0        d3-ubuntu.datado Wed Aug  1 18:23 - 19:37  (01:13)
sysadmin ssh          d3-ubuntu.datado Wed Aug  1 18:23 - 19:16  (00:53)
sysadmin pts/4        128.222.90.62    Wed Aug  1 14:42 - 15:34  (00:51)
sysadmin ssh          128.222.90.62    Wed Aug  1 14:42 - 14:43  (00:00)
(...)
The second column in this output shows the hostname or IP address of the host from which these logins are occurring. A large number of lines should have the same hostname. Ask the customer whether DPA or their own custom monitoring script is running on that host.

To apply the workaround:

Enter bash mode.

Truncate /var/log/wtmp and its rotations with the echo –n command:

  for i in /var/log/wtmp*; do echo -n > $i; done
Check the size of /var/log/wtmp and its rotations with the ls –lh /var/log/wtmp* command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # ls -lh /var/log/wtmp*
-rw-rw-r--  1 root utmp 0 Aug  2 15:06 /var/log/wtmp
-rw-rw-r--  1 root utmp 0 Aug  2 15:06 /var/log/wtmp.1
The files should have a size equal to or very close to zero.

Ensure the space utilization in the root partition is less than 90% with the df –h command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/dd_dg00p15       2.0G  1.6G  377M  81% /
/dev/dd_dg00p14       5.0G  952M  3.8G  20% /ddr
/dev/dd_dg00p13        79G   29G   47G  38% /ddr/var
shm                   3.9G     0  3.9G   0% /dev/shm
/dev/dd_dg00p7         13G   13G     0 100% /ddr/col1/repl
localhost:/data       7.7T   21G  7.7T   1% /data

Check that the alert has cleared.

Set the rotation schedule of /var/log/wtmp to daily in /etc/logrotate.conf by changing the frequency on line 19 from monthly:

     17 # no packages own wtmp -- we'll rotate them here
     18 /var/log/wtmp {
     19     monthly
     20     create 0664 root utmp
     21     rotate 1
     22 }
     To daily:

   17 # no packages own wtmp -- we'll rotate them here
     18 /var/log/wtmp {
     19     daily
     20     create 0664 root utmp
     21     rotate 1
     22 }

Save some more space by applying the /var/upgrade/link_to_new_rpmfile.rpm change in the document:

Moved link_to_new_rpmfile.rpm from root to ddr partition 181045

Separate potential cause:  The issue with bug 109239 involves a log file being written to the /tmp directory instead of /ddvar. The system may have hit this bug if there is a large "/tmp/sub_kern.info.XXXXX" file in /tmp. The workaround is to simply move or delete that file. The bug is fixed in DD OS 5.4.3.0, 5.5.1.0, 5.6, or later. See bug notes for more details.

No comments:

Post a Comment