September 25, 2014

Alert "Space in root exceeded 90%" on Data Domain Systems

Note that this alert is not the same as these alerts:

Space usage in Data Collection has exceeded 90%.
Space usage in /ddvar has exceeded 90%.

APPLIES TO

All Data Domain systems that are being monitored through the EMC Data Protection Advisor tool or through a custom script that logs into and out of the system excessively.
All Software Releases prior to 5.2.1.0.

PURPOSE

This article explains how to troubleshoot cases in which the customer receives the alert Space usage in root has exceeded 90%. as a result of excessive logins and logouts by a custom script or the DPA tool.

CAUSE

Every successful login and logout on a DDR is logged in the /var/log/wtmp file. This file is rotated monthly;  if there are a large number of logins and logouts over the course of a month this file can grow to become very large thus triggering this alert.  There is also a known bug that can cause root to fill (see "Separate potential cause" below).

SOLUTION

To determine if the login/wtmp file issue described herein is responsible for the alert:

Enter bash mode.

Check the space utilization in the root partition with the df –h command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/dd_dg00p15       2.0G  1.6G  377M  91% /
/dev/dd_dg00p14       5.0G  952M  3.8G  20% /ddr
/dev/dd_dg00p13        79G   29G   47G  38% /ddr/var
shm                   3.9G     0  3.9G   0% /dev/shm
/dev/dd_dg00p7         13G   13G     0 100% /ddr/col1/repl
localhost:/data       7.7T   21G  7.7T   1% /data
This alert triggers when the Use% on the partition Mounted on / exceeds 90%.

Check the size of /var/log/wtmp and its rotations with the ls –lh /var/log/wtmp* command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # ls -lh /var/log/wtmp*
-rw-rw-r--  1 root utmp 89M Aug  2 14:57 /var/log/wtmp
-rw-rw-r--  1 root utmp 92M Jul 31 21:06 /var/log/wtmp.1
If these files exceed 60M in size then they are almost surely the cause of the alert.

Investigate what is causing the log to fill up by dumping it:

!!!! dd630-rtp2 YOUR DATA IS IN DANGER !!!! # last -f /var/log/wtmp | less
sysadmin pts/0        d3-ubuntu.datado Wed Aug  1 18:23 - 19:37  (01:13)
sysadmin ssh          d3-ubuntu.datado Wed Aug  1 18:23 - 19:16  (00:53)
sysadmin pts/4        128.222.90.62    Wed Aug  1 14:42 - 15:34  (00:51)
sysadmin ssh          128.222.90.62    Wed Aug  1 14:42 - 14:43  (00:00)
(...)
The second column in this output shows the hostname or IP address of the host from which these logins are occurring. A large number of lines should have the same hostname. Ask the customer whether DPA or their own custom monitoring script is running on that host.

To apply the workaround:

Enter bash mode.

Truncate /var/log/wtmp and its rotations with the echo –n command:

  for i in /var/log/wtmp*; do echo -n > $i; done
Check the size of /var/log/wtmp and its rotations with the ls –lh /var/log/wtmp* command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # ls -lh /var/log/wtmp*
-rw-rw-r--  1 root utmp 0 Aug  2 15:06 /var/log/wtmp
-rw-rw-r--  1 root utmp 0 Aug  2 15:06 /var/log/wtmp.1
The files should have a size equal to or very close to zero.

Ensure the space utilization in the root partition is less than 90% with the df –h command:

!!!! dd630-rtp1 YOUR DATA IS IN DANGER !!!! # df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/dd_dg00p15       2.0G  1.6G  377M  81% /
/dev/dd_dg00p14       5.0G  952M  3.8G  20% /ddr
/dev/dd_dg00p13        79G   29G   47G  38% /ddr/var
shm                   3.9G     0  3.9G   0% /dev/shm
/dev/dd_dg00p7         13G   13G     0 100% /ddr/col1/repl
localhost:/data       7.7T   21G  7.7T   1% /data

Check that the alert has cleared.

Set the rotation schedule of /var/log/wtmp to daily in /etc/logrotate.conf by changing the frequency on line 19 from monthly:

     17 # no packages own wtmp -- we'll rotate them here
     18 /var/log/wtmp {
     19     monthly
     20     create 0664 root utmp
     21     rotate 1
     22 }
     To daily:

   17 # no packages own wtmp -- we'll rotate them here
     18 /var/log/wtmp {
     19     daily
     20     create 0664 root utmp
     21     rotate 1
     22 }

Save some more space by applying the /var/upgrade/link_to_new_rpmfile.rpm change in the document:

Moved link_to_new_rpmfile.rpm from root to ddr partition 181045

Separate potential cause:  The issue with bug 109239 involves a log file being written to the /tmp directory instead of /ddvar. The system may have hit this bug if there is a large "/tmp/sub_kern.info.XXXXX" file in /tmp. The workaround is to simply move or delete that file. The bug is fixed in DD OS 5.4.3.0, 5.5.1.0, 5.6, or later. See bug notes for more details.

DataDomain Cleaning Phases

Cleaning is an important process on a Data Domain system. It is important because it is used to prevent overwriting data. Unfortunately, this process can impact the performance of a system, and it can take more than 24 hours to complete in unusual cases. This document will help identify what is happening during a particular phase. It is possible to get an idea of how long previous cleaning sessions have taken by searching the messages log for "cleaning completed".

DD OS 4.2 and earlier:

In DD OS 4.2 and earlier, there are 6 phases. These steps are also present in 4.3.

An understanding of them will help understanding the 4.3 and above cleaning phases.

Candidate - The candidate phase is run to select a subset of data to clean and remember what is in the data.
Enumerate - enumerate all the files in the logical space and remember what data is alive.
Merge - do an index merge to flush index data to disk.
Filter - if duplicate data has been written, find out where it is.
Copy - copy live data forward and free the space it occupied
Summary - create a summary of the live data that's on the system.

DD OS 4.3 and up to DD OS 5.4, explanation of the phases listed below:

Beginning in DD OS 4.3, the cleaning process (Full Cleaning) will take one of two paths, depending on the number of containers in use. This is due to a limit on the number of containers that can be cleaned on a single cleaning run.

Sampling is required for a filesystem that uses more containers than the limit. In that case, the cleaning process will perform focused cleaning on a subset of containers that have the most reclaimable space. All cleaning phases below will be followed including phases 5-8.
Note that phases 6-9 will restrict their working set to the candidate containers obtained in phase 5.

Different DDR models have different amounts of memory so the amount of physical space that can be cleaned in a single cleaning run depends on that. On systems that are fairly empty with the number of containers used below 25-30% of the total container set, all the physical space can be cleaned in a single cleaning run. The cleaning process will complete much more quickly for these systems because the cleaning process will skip directly from phase 4 to phase 9, the copy phase, eliminating phases 5-8. Note that the phases skipped will be displayed as 100% complete.

Pre-enumerate - enumerate all the files in the logical space. It may only sample part of the data to help with estimating where live data is located in physical space.
Pre-merge - do an index merge to flush index data to disk.
Pre-filter - if duplicate data has been written, find out where it is.
Pre-select - select the physical space that has the most dead data. This is what we want to clean.
At this point the cleaning process will follow one of the two paths described above, depending on the number of containers in the filesystem.


  • Candidate - due to memory limitations, only a fraction of physical space can be cleaned in each cleaning run. The candidate phase is run to select a subset of data to clean and remember what is in the data.
  • Enumerate - enumerate all the files in the logical space and remember what data is active.
  • Merge - do an index merge to flush index data to disk.
  • Filter - determine what duplicate data has been written and find out where it is.
  • Copy - copy live data forward and free the space it used to occupy
  • Summary - create a summary of the live data that's on the system.

Other information about GC/cleaning:

Note that the phase number may be different for the version of DDOS on your system. For Full Cleaning, phase 1 (pre-enumeration) and 6 (enumeration) can take a long time when the following conditions are present:

  • Poor Lp locality: Cleaning will be slowed if there is significant fragmentation across containers.
  • Very high global compression: If 2 DDRs consume the same amount of physical space (i.e. # of containers) and one DDR has x50 global compression and the other has x100 global compression, then the time it takes to enumerate the second DDR would be longer than the first DDR because we have a much larger logical space to traverse in the second DDR.
  • Many small files
  • Replication is lagging behind.
  • For Full Cleaning, the runtime of phase 1 and phase 6 depends on the logical size of the filesystem (i.e. Logical Bytes).
  • The runtime of other phases depends on the physical size of the fileystem (i.e. # of containers in use).
  • Performance bottleneck
  • Before DD OS 4.5, index merge could be a performance bottleneck. This has been fixed in DD OS 4.5 and beyond.
  • Pre-enumeration/enumeration/copy phases are the most time-consuming phases in GC/cleaning.
  • Copy phase (phase 9 in Full Cleaning or Phase 11 in Physical Cleaning) can take a long time for the following cases:
  • High live percentage of containers selected for copy forward: Not enough physical data deleted before running GC/cleaning. It is possible that GC/cleaning is being run more often than needed.
  • Additional processing: Re-encryption, recompression, features, sketching
  • This feature exists in DDOS 5.0 and beyond. Upgrade from pre-5.0 to 5.0 and beyond will experience slowness in the first round of GC/cleaning since features are computed in each container.
  • Enabling delta replication requires sketch, which will require an extra cycle in GC/cleaning to recompute sketch during the copy phase.
  • GZ local compression is significantly more expensive than LZ.
  • The performance cost of encryption and key-rotation (5.2) is significant.
  • Note that the percentage complete in the enumeration phase (phase 6 in Full Cleaning or phase 9 in Physical Cleaning) can actually drop if new files are added to the system while cleaning is in progress, or if a new fastcopy or checkpoint is created.
  • Copy phase (phase 9 in Full Cleaning or phase 11 in Physical Cleaning) generally takes the longest as this is where deleting/copying takes place. This is where the results of the cleaning can be observed with the "df" command.

Client side scripting on DataDomain

The basic technique is to use ssh private/public key pairs to allow login and remote execution of commands without performing an interactive login. A private key is generated on the client, and the corresponding public key is stored on the Data Domain Restorer (DDR) using the adminaccess set ssh-keys command. After adding this public key to the DDR, subsequent logins are authenticated using the private key presented by the client. If the private key presented by the client matches the public key stored on the DDR, the login is permitted (without requiring a password).

The basis for this mechanism is ssh. There are a number of ssh implementations available for many different platforms, including open ssh, putty, etc. This document is written with openssh running on a linux client. A similar setup is possible on a Windows client using openssh with cygwin, or using putty.

Note: Data Domain devices con ONLY handle 10 concurrent ssh client sessions at once. Take this into consideratios when scheduling your scripted tasks.

Creating a public/private key pair

The first step to enabling “passwordless” login is to create a private/public key pair. Using openssh, the process is as shown in figure 1. The user is “fred” on a linux host named “cotton”. This example assumes that Fred’s .ssh directory starts completely empty (please move or rename any existing .sshdirectory before following this example).

Figure 1: Generating a private/public key (user input in red)

fred@cotton ~ $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/auto/home/fred/.ssh/id_dsa):
Created directory '/auto/home/fred/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /auto/home/fred/.ssh/id_dsa.
Your public key has been saved in /auto/home/fred/.ssh/id_dsa.pub.
The key fingerprint is:

df:16:ed:f2:a4:5a:a0:b8:bb:27:99:51:9b:cb:ee:48 fred@cotton.sample.com

fred@cotton ~ $ ls .ssh
id_dsa id_dsa.pub
Note that this example uses an empty passphrase. The return key was pressed at the “Enter passphrase (empty for no passphrase):” prompt (as indicated in blue in figure 1). Please note that this means the private key will be stored unencrypted on the local filesystem (in the file id_dsa). After completing the following steps, anyone with read access to /auto/home/fred/.ssh/id_dsa will be able run any command on the DDR without providing a password.

If the security of the local filesystem cannot be guaranteed, we recommend using a passphrase to encrypt the private key. In order to allow passwordless login, however, a passphrase encrypted private key will require the use of ssh-agent and ssh-add. These commands allow you to enter the passphrase once interactively, with subsequent commands using process memory to store the unencrypted password (hopefully more secure than the filesystem). The usage of these commands is beyond the scope of this document, however. The remainder of this document assumes the private key was stored without a passphrase.

Adding the public key to the DDR

Once the key pair has been created, the next step is to store the public key on the DDR. Note that the private key must be kept secret — anyone possessing this key will be able to run any command they wish on the DDR without providing a password. The public key, however, may be distributed freely (the public key is stored in id_dsa.pub). It is not possible to determine a private key from a public key. It is possible, however, to tell if a given private key matches a given public key.

The easiest way to enter the public key on the DDR is to use “cut and paste” with a mouse. First display the text using the following command:

Figure 2: Display the public key

fred@cotton ~ $ cat .ssh/id_dsa.pub
ssh-dss
AAAAB3NzaC1kc3MAAAEBALi3J6zJW57YWwgyQ6QXF9hq3MNOcKuR+PCk5q0qlE0zgqXtjbNAnJWP7gs7v9pXpBifEoA
0mUC7K2GZQKUMByU90r7j4oQJg/+2LG8AKhopkFHsOR+8cHDU0oppnwyQFoa1HOwrK7m9Uetsb8jWzIZ8D+tvcDcf0Y
zKY0ExI65VCxuthvLdqEZavM1btj/DQtUrUwB/OxbeV77Sl8HM2leQ4iw2VbihfxCfp6VrS5wIpNqfZdAZEhfE5AM2a
vS9/WdRY+ROj7Sb/5JyN2zK/A9MzlOnVQgTU58vyQm4XpLVJKyF7sbPK/Z6tLtbYhobVdVVeLb3IZbRjNbbN/X9CBUA
AAAVAKxCSxSTsrHaxoEeP5ff1QeKhPpPAAABAGKsikkkJA19M7pEByx/G0URB+ukQbnaF3fTuJ71arr0cHMpwqPkbko
2Kibyf7wP4TekPMaFz7vYCm2696lQn8XsUJvaYRNQ1ghEVCXfHm3IKebEPM7437OoKQUSzyKp9hETj8BcMX+RWotYa4
1OC/NiZ708POQxsVlhctdvFx+PxSeAcS1/JVvzLnlWU6U8W7Cyi83J4J+EHfICP0e2JwUfdbJXhrls7sk7QJMvkx7T/
5U8i6kvnPyrxl5J2PKvM5D0W7anEdHP/M/HkZ4kYoj++O8c4Ptautcf60qXDoxLmkhLUoKpZm+G9vaEI4B8G5JWYLFI
RQj8KZPG27O92vAAAAEBAI1I8uFOt71j6/+QHzru9tpY67NAMMlOMU4MiOUwpgWMkOaGH2ciXhNWGv1513BsH6mW8jY
p5aTMSwgyV0abPDM1lYrSSwhrd8bPeCU53m46UeyD1opaMPayKDmEAAaXxBydP2QU8t+WVizK+RzSUCSazpTltIMW5N
uatdp9Kr92A8Iq0E1x99Mq0nhXavskMxgJ6PX1g0SVLarBARDuVtbz7UsHcTZOGns2wLl/OivVhanfCIBv5ZKfRMqUa
OlADVtI26QYllNs5+XEqDvPod8964QZEVbkITIT7Jxj2VOaaR/HQGYA29J8EBD1iY1HBV+PIdgtoEk9/TCwxa2asW0=
fred@sample.com
Select and copy all of the lines of text using your mouse (the text beginning “ssh-dss” and ending “...W0=fred@sample.com”).

Storing the public key on the DDR

Next you will login to the DDR and store this public key. You must, of course, login to the DDR in order to enter these commands. Because the public key has not yet been added, you will need to provide the password for this login. This example will use the DDR named dd410.sample.com and the sysadmin user.

Figure 3: Add the public key to the DDR

fred@cotton ~ $ ssh sysadmin@dd410.sample.com
The authenticity of host 'dd410.sample.com (192.168.4.135)' can't be established.
RSA key fingerprint is 08:1d:ff:cf:2b:00:53:c8:10:e3:01:82:af:df:9f:2e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'dd410.sample.com,192.168.4.135' (RSA) to the list of known hosts.
Password:
Last login: Sat Feb 17 16:18:50 2007 from cotton.sample.com

Welcome to Data Domain OS 4.0.5.0-34686
---------------------------------------
sysadmin@dd410-1# adminaccess show ssh-keys
sysadmin@dd410-1# adminaccess add ssh-keys

Enter the key and then press Control-D, or press Control-C to cancel.

ssh-dss
AAAAB3NzaC1kc3MAAAEBALi3J6zJW57YWwgyQ6QXF9hq3MNOcKuR+PCk5q0qlE0zgqXtjbNAnJWP7gs7v9pXpBifEoA
0mUC7K2GZQKUMByU90r7j4oQJg/+2LG8AKhopkFHsOR+8cHDU0oppnwyQFoa1HOwrK7m9Uetsb8jWzIZ8D+tvcDcf0Y
zKY0ExI65VCxuthvLdqEZavM1btj/DQtUrUwB/OxbeV77Sl8HM2leQ4iw2VbihfxCfp6VrS5wIpNqfZdAZEhfE5AM2a
vS9/WdRY+ROj7Sb/5JyN2zK/A9MzlOnVQgTU58vyQm4XpLVJKyF7sbPK/Z6tLtbYhobVdVVeLb3IZbRjNbbN/X9CBUA
AAAVAKxCSxSTsrHaxoEeP5ff1QeKhPpPAAABAGKsikkkJA19M7pEByx/G0URB+ukQbnaF3fTuJ71arr0cHMpwqPkbko
2Kibyf7wP4TekPMaFz7vYCm2696lQn8XsUJvaYRNQ1ghEVCXfHm3IKebEPM7437OoKQUSzyKp9hETj8BcMX+RWotYa4
1OC/NiZ708POQxsVlhctdvFx+PxSeAcS1/JVvzLnlWU6U8W7Cyi83J4J+EHfICP0e2JwUfdbJXhrls7sk7QJMvkx7T/
5U8i6kvnPyrxl5J2PKvM5D0W7anEdHP/M/HkZ4kYoj++O8c4Ptautcf60qXDoxLmkhLUoKpZm+G9vaEI4B8G5JWYLFI
RQj8KZPG27O92vAAAAEBAI1I8uFOt71j6/+QHzru9tpY67NAMMlOMU4MiOUwpgWMkOaGH2ciXhNWGv1513BsH6mW8jY
p5aTMSwgyV0abPDM1lYrSSwhrd8bPeCU53m46UeyD1opaMPayKDmEAAaXxBydP2QU8t+WVizK+RzSUCSazpTltIMW5N
uatdp9Kr92A8Iq0E1x99Mq0nhXavskMxgJ6PX1g0SVLarBARDuVtbz7UsHcTZOGns2wLl/OivVhanfCIBv5ZKfRMqUa
OlADVtI26QYllNs5+XEqDvPod8964QZEVbkITIT7Jxj2VOaaR/HQGYA29J8EBD1iY1HBV+PIdgtoEk9/TCwxa2asW0=
fred@sample.com
^D

Trying to add this much text without some form of “cut and paste” is far too difficult and error prone, but with sufficient effort, manual entry is possible (it’s also possible to generate a shorter key using the -b 512 option when generating the keys with the ssh-keygen command).

Figure 4: Verify only one key is stored and quit

sysadmin@dd410-1# adminaccess show ssh-keys
1         ssh-dss
AAAAB3NzaC1kc3MAAAEBALi3J6zJW57YWwgyQ6QXF9hq3MNOcKuR+PCk5q0qlE0zgqXtjbNAnJWP7gs7v9pXpBifEoA
0mUC7K2GZQKUMByU90r7j4oQJg/+2LG8AKhopkFHsOR+8cHDU0oppnwyQFoa1HOwrK7m9Uetsb8jWzIZ8D+tvcDcf0Y
zKY0ExI65VCxuthvLdqEZavM1btj/DQtUrUwB/OxbeV77Sl8HM2leQ4iw2VbihfxCfp6VrS5wIpNqfZdAZEhfE5AM2a
vS9/WdRY+ROj7Sb/5JyN2zK/A9MzlOnVQgTU58vyQm4XpLVJKyF7sbPK/Z6tLtbYhobVdVVeLb3IZbRjNbbN/X9CBUA
AAAVAKxCSxSTsrHaxoEeP5ff1QeKhPpPAAABAGKsikkkJA19M7pEByx/G0URB+ukQbnaF3fTuJ71arr0cHMpwqPkbko
2Kibyf7wP4TekPMaFz7vYCm2696lQn8XsUJvaYRNQ1ghEVCXfHm3IKebEPM7437OoKQUSzyKp9hETj8BcMX+RWotYa4
1OC/NiZ708POQxsVlhctdvFx+PxSeAcS1/JVvzLnlWU6U8W7Cyi83J4J+EHfICP0e2JwUfdbJXhrls7sk7QJMvkx7T/
5U8i6kvnPyrxl5J2PKvM5D0W7anEdHP/M/HkZ4kYoj++O8c4Ptautcf60qXDoxLmkhLUoKpZm+G9vaEI4B8G5JWYLFI
RQj8KZPG27O92vAAAAEBAI1I8uFOt71j6/+QHzru9tpY67NAMMlOMU4MiOUwpgWMkOaGH2ciXhNWGv1513BsH6mW8jY
p5aTMSwgyV0abPDM1lYrSSwhrd8bPeCU53m46UeyD1opaMPayKDmEAAaXxBydP2QU8t+WVizK+RzSUCSazpTltIMW5N
uatdp9Kr92A8Iq0E1x99Mq0nhXavskMxgJ6PX1g0SVLarBARDuVtbz7UsHcTZOGns2wLl/OivVhanfCIBv5ZKfRMqUa
OlADVtI26QYllNs5+XEqDvPod8964QZEVbkITIT7Jxj2VOaaR/HQGYA29J8EBD1iY1HBV+PIdgtoEk9/TCwxa2asW0=
fred@sample.com

sysadmin@dd410-1#quit
Connection to dd410-1.support.sample.local closed.
After adding the key, verify that the paste was performed correctly and that there are no extraneous characters orlinebreaks. The end result should be a single line returned by adminaccess show ssh-keys. If, instead ofoutput similar to figure 4 you see multiple lines, use the adminaccess del ssh-keys 1 command multiple timesuntil there are no more lines remaining, then repeat the process (using a different form of “cut and paste” or manually typing in the key if necessary).

Running commands remotely without a password

Now, if everything has gone correctly it should be possible to login to the DDR from the client without providing a password:

Figure 5: Verify that ssh from the client no longer requires a password

fred@cotton ~ $ssh sysadmin@dd410-1.support.sample.local
Last login: Mon Feb 19 09:35:59 2007 from cotton.sample.com

Welcome to Data Domain OS 4.0.5.0-34686
---------------------------------------
sysadmin@dd410-1#quit
Connection to dd410-1.support.sample.local closed.


Note that in addition to starting an interactive ddsh login without providing a password, it is also possible to run individual , non-interactive DD OS CLI commands by adding the command to the ssh command line (see figure 6).

Figure 6: Running non-interactive commands

fred@cotton ~ $ssh sysadmin@dd410-1.support.sample.local system show uptime
09:37:41 up 10 days, 2:19, 2 users, load average: 0.01, 0.03, 0.04
Filesystem has been up 0 days, 23:21.

fred@cotton ~ $ssh sysadmin@dd410-1.support.sample.local system show version
Data Domain OS 4.0.5.0-34686

fred@cotton ~ $ssh sysadmin@dd410-1.support.sample.local autosupport show report

========== GENERAL INFO ==========
GENERATED_ON=Mon Feb 19 09:45:34 PST 2007
VERSION=Data Domain OS 4.0.5.0-34686
[remaining text deleted]

Particularly useful commands for monitoring the status of a DDR include, but are obviously not limited to:

system show performance
replication status all
replication show stats all
filesys show space
filesys show compression last 24 hours
Sample Script

This technique can be used to write arbitrary scripts. An example of such a script is shown in figure 7. This script is intended to log the status of a replication context at ten minute intervals.

 Figure 7: Replication status logger



This script emits a single line every ten minutes containing a timestamp, the “destination remaining”, “source remaining, and “compressed remaining” fields from replication show stats; and the “state” and “connection” fields from replication status(refer to the Restorer User’s Guide for explanations for these fields). Sample output is shown in figure 8. Any non-zero values for any of the “remaining” fields indicates replication work remains to be done (the two DDRs aren’t in synch), while any value for “state” other than “normal” indicates a problem of some sort (a connection disruption for example).

Figure 8: Sample output from script

fred@cotton ~ $ ./repl_status_logger.pl
Mon Feb 19 16:19:01 2007 0 0 0 normal connected since Sun Feb 18 11:18:26
Mon Feb 19 16:29:01 2007 0 0 0 normal connected since Sun Feb 18 11:18:26
Mon Feb 19 16:39:01 2007 0 0 0 normal connected since Sun Feb 18 11:18:26
Mon Feb 19 16:49:01 2007 0 0 0 normal connected since Sun Feb 18 11:18:26
^C

As written, this script is not terribly robust. A number of improvements should be considered before using this script in production:

No error checking is performed. If the ssh command fails for any reason (another administrator removing the key, for example) the problem will almost certainly not be caught. If the DDR being monitored cannot be reached, a meaningful log entry should be generated.

The program runs in an endless loop, sleeping 10 seconds between each iteration. If the process were to be killed for any reason, however, the script would not be restarted. This script is a “daemon” and should be run under some form of daemon control that restarts the process if it dies for any reason (while simultaneously ensuring that there is never more than one process running at a time). The daemon should also be restarted at boot time.

The output is written directly to stdout. To make this more useful, the output should be redirected to a file, syslog, or some other logging system.There may be other fields of interest from the replication show stats or replication status commands that might be useful. If desired, other statistics could also be monitored and logged (network interface statistics, system load, etc.).

The configuration for this script is contained in hard-coded variables at the top of the script. To make the script more generally applicable, it might be worth passing in the configuration information via command-line arguments, a config file, or some other mechanism. Some form of error checking should also be performed on these configuration arguments: the INTERVAL variable, for example, should not normally be less than about 5 or 10 seconds to prevent undue load on the DDR being monitored (just the act of logging into the system via ssh or otherwise causes work unrelated to backup and restore).

NFS Best Practices for DataDomain

This document provides information on a set of recommended best practices when deploying Data Domain storage systems for data protection and archive with the NFS (Network File System) protocol. This paper will provide insights on how to best tune NFS network components in order to optimize the NFS services on a Data Domain system. Best practices on configuring the NFS server, application clients, network connectivity and security.

Introduction:

Network file system (NFS) protocol originally developed by Sun Microsystems in 1983 is a client-server protocol, allowing a user on a client computer to access files over a network from the server as easily as if the network devices were attached to its local disks.NFS allows UNIX systems to share files, directories and files systems. The protocol is designed to be stateless.

Data Domain support NFS version 3 (NFSv3) the most commonly used version.  It uses UDP or TCP and is based on the stateless protocol design. It includes some new features, such as a 64-bit file size, asynchronous writes and additional file attributes to reduce re-fetching.

Configuration:

DataDomain:

Make sure NFS service is running on the DD-- #nfs status
It is best practice to use hostname of the client while creating the nfs export.

Please make sure all the forward and reverse lookups of the hostnames in the nfs export list are correct in the DNS server.  #net lookup

To add more than one client use a comma or space or both.
A client can be a fully-qualified domain hostname, class-C IP addresses, IP addresses with either netmasks or length, or an asterisk (*) wildcard with a domain name, such as *.yourcompany.com. An asterisk (*) by itself means no restrictions.

It is a best practice to keep the client entries in the nfs export list to a manageable state.  This can be achieved by using acl entries like *.company.com, network address.

A client added to a sub-directory under /backup has access only to that sub-directory.
The are a comma-separated or space-separated or both list bounded by parentheses. With no options specified, the default options are rw, root_squash, no_all_squash, and secure. 

The following options are allowed:

NFS Options

Following is the example of adding the nfs export client

#nfs add /backup 192.168.29.30/24 (rw,no_root_squash,no_all_squash,secure)
DD doesn’t support NFS locking.   Please look at the NFS locking KB for detailed information.
It is always a best practice to use multiple NFS mounts on the client with multiple IPs on Data Domain for better performance.

For ex. TSM data is backed to /backup/TSM directory and SQL to /backup/SQL
Create separate NFS export for /backup/TSM  and /backup/SQL
#nfs add /backup/TSM hostname.company.com (rw,no_root_squash,no_all_squash,secure)
#nfs add /backup/SQL hostname.company.com (rw,no_root_squash,no_all_squash,secure)

On the client mount them as two different mount points using two different IPs of DD.  
#mount -F nfs -o hard,intr,llock,vers=3,proto=tcp,timeo=1200,rsize=1048600,wsize=1048600 :/backup/TSM /ddr/TSM
#mount -F nfs -o hard,intr,llock,vers=3,proto=tcp,timeo=1200,rsize=1048600,wsize=1048600 <:/backup/SQL /ddr/SQL

Security

Port 2049 (NFS) and Port 2052 (mountd) must be open on the firewall if any existing.
It is a best practice for better security use the “root_squash” nfs export option while configuring the export.
Along with root_squash option set the anongid and anonuid  to a specific ID on DD.
For example:
#nfs add /backup/TSM hostname.company.com (rw,root_squash,no_all_squash,secure,anongid=655,anonuid=655)

Configuration

Please make sure nfsd daemon is running on the client.
   #/sbin/service nfs status
It is a best practice to create a separate MTREE for each media/Database server.  This will help in performing less number of Meta data operations.
Check for stable writes. Run tcpdump when a backup is performed and check for stable flag in write packet.   If clients are sending stable small writes less than 256KB, performance will degrade. It is due to pipeline commit behavior in 5.1, where there are long write latency with stable small writes.  This issue is fixed in 5.1.1.3.
Mount option we recommend to use:

Linux OS:

# mount -t nfs -o hard,intr,nolock,nfsvers=3,tcp,timeo=1200,rsize=1048600,wsize=1048600,bg HOSTNAME:/backup /ddr/backup
# mount -t nfs -o hard,intr,nolock,nfsvers=3,tcp,timeo=1200rsize=1048600,wsize=1048600,bg HOSTNAME:/data/col1/  /ddr/

Solaris OS:

#mount -F nfs -o hard,intr,llock,vers=3,proto=tcp,timeo=1200,sec=sys,rsize=1048600,wsize=1048600 HOSTNAME:/backup /ddr/backup

# mount -F nfs -o hard,intr,llock,vers=3,proto=tcp,timeo=1200,sec=sys,rsize=1048600,wsize=1048600 HOSTNAME:/data/col1/ /ddr/

AIX OS:

# mount –V nfs –o intr,hard,llock,timeo=1200,rsize=65536,wsize=65536,vers=3,proto=tcp,\combehind,timeo=600,retrans=2 -p HOSTNAME:/backup /ddr
# mount –V nfs –o intr,hard,llock,timeo=1200,rsize=65536,wsize=65536,vers=3,proto=tcp,\combehind,timeo=600,retrans=2 -p HOSTNAME:/data/col1/ /ddr/

Network

If the client is having multiple network interfaces please make sure the routing is correct and the NFS I/O operations are using the correct interface.
Set the tcp buffers to the max value the client OS can support.  For more detailed information please look at the below tuning guide for each client OS.
Using ping check the RTT between the client and the DD.
If there are any firewall between the media server and the DD make sure NFS port 2049 is open.
Please make sure the MTU size used is same throughout the data path.    Verify that the MTU size is consistent between DDR and client.  An inconsistent MTU will cause fragmentation which can lead to slow backups.  The tracepath tool can be used to check mtu size:
#tracepath -n

Client OS Tuning

Linux OS

The Linux tuning guide describes how to optimize the backup and restore performance to DD.   This tuning guide contains the tcp and memory tuning parameters and recommended mount options.
Please use the following tuning guide for more details:   Linux tuning guide
Solaris OS

The Solaris tuning guide describes how to optimize the backup and restore performance to DD.   This tuning guide contains the tcp and memory tuning parameters and recommended mount options.
Please use the following tuning guide for more details: Solaris Tuning guide
AIX and TSM tuning Guide

The AIX and TSM tuning guide describes how to improve backup performance within AIX/TSM environments through OS level tuning of the AIX server and TSM backup application.
Please use the following tuning guide for more details: AIX and TSM tuning guide
Note: DD recommends that you properly test all planned configurations in a testing environment before applying it to a production environment. You should also back up all your pre-tuning configurations.

September 11, 2014

Configure DD VTL Tapes in NetBackup

Steps to configure DataDomain VTL tapes in NetBackup:

  •  Check HBA details: 
    •  fcinfo hba-port
  •  List current configuration using: 
    •  cfgadm -al
  •  Configure the appropriate controller:
    •  cfgadm -c configure c10 or cfgadm -c configure c10 -force
  • Scan Solaris for new devices:  
    • devfsadm -v
  •  Check devices in /dev/rmt location:
    •  ls -lrth /dev/rmt/*
  • Check for scsa drivers: 
    • modinfo | grep -i scsa
  •   Unload sg driver:
    •   modunload -i  213 (only sg driver)
  • Backup current sg.conf file:  
    • mv /kernel/drv/sg.conf /kernel/drv/sg.conf.orig
  • Build new sg drivers:  
    • cd /usr/openv/volmgr/bin/driver/
    •   ../sg.build all -mt 30 -ml 40 ('mt' and 'ml' values can be selected based on environment)
    •   ./sg.install
  •   Scan for new drives and robots:
    •  .tpautoconf -r (detect robot)
    •   tpautoconf -t (detect tapes)


Note: This document assumes VTL configuration has been done on Data Domain before running the above given commands. Also the above steps implies the OS is Solaris. Linux or other Operating systems may have different commands to execute

September 10, 2014

Enable AIR on DataDomain

The steps are (assume DD7200 is the source DataDomain and DD7200A is the destination DataDomain):

1) Create an storage unit on the source DataDomain
ddboost storage unit create airA
2) Create an storage unit on the destination DataDomain
ddboost storage unit create airB
3) Link them
On DD7200
ddboost association create aira replicate-to DD7200A airb 
On DD7200A

ddboost association create airb replicate-from DD7200 aira
4) On source NetBackup master server, create a Diskpool that must target the volume created in Step 1
Keep in mind that if that volume does not appear as a "Replication Source" type, then Step 3 is incorrect
5) On source NetBackup master server, create an SLP (we named it AIRTEST) containing one backup against the storage unit type source (in our case aira) and a replication

6) On source NetBackup master server create a test backup policy, targeting in the Policy storage the SLP created in step 5
7) On destination NetBackup master server, create a Diskpool that must target the volume created in Step 2
That volume must be of type Replication target. Otherwise Step 3 was not performed or performed incorrectly
8) On destination NetBackup master server, create an SLP with the same name as in Step 5 (in this case AIRTEST), but it must be an Import operation against the DiskPool created in Step 7
When you start the backup in source NetBackup, a ddbost job will run against the source DataDomain. When the backup completes, a replication from DataDomain source to DataDomain destination will start.
Once the replication finishes an event will be generated (you should be able to see it with # ddboost event show). That notifies destination NB that an import must be started. You will see an import job in destination NetBackup Master Server.
This is very textual, but it is how it works,,!

Backups to OST using DD fails with Error 2106:Disk storage is down

Issue



After updating DataDomain plug-in (version 2.5.0.3) and running thetpconfig command to change the password on a NetBackup 7.5.0.4 media server configured for OpenStorage [OST(DD)] storage, all subsequent backups written to the DataDomain storage have failed with the error Disk storage is down (2106) reported.

Error



Error nbjm (pid=4340) NetBackup status 2106, EMM status Storage server is Down or unavailable.
Disk storage is down (2106)

Environment



NetBackup 7.5.x OpenStorage(DataDomain) Environment:
  • Master server: NBU_Master - Windows 2008 R2 (x64) Enterprise server SP1 - running NetBackup 7.5.0.4 
  • Media server: OST_DD_Media - Windows 2008 R2 (x64) Enterprise server SP1 - running NetBackup 7.5.0.4 
  • Disk storage: DataDomain_SU01 | Disk Type: OST(DD) | Media Server: OST_DD_Media | Disk Pool: OST_DD_Pool01 
  • Disk_Pool: OST_DD_Pool01  - Storage Server: OST_DD_Storage01 
Note: All server names listed above are generic names used only for the examples in this document

Cause



Troubleshooting:
{Checked the OST(DD) configuration on the upgraded Media server}:
Opened a command prompt to ..\NetBackup\bin\admincmd (Note: On UNIX/Linux servers, the path to these commands is /usr/openv/netbackup/bin/admincmd), and ran the following commands: 
nbdevquery -liststs -stype DataDomain -U
- Checked the State : DOWN 
nbdevquery -listdp -stype DataDomain -U
- Disk_Pool: OST_DD_Pool01 = UP - Storage Server: OST_DD_Storage01 = DOWN 
nbdevquery -listdv -stype DataDomain -U
- OST_DD_Storage01 = DOWN | Disk Volume name: ddboost - Flag: InternalDown 
bpstsinfo -pi
- Properly displayed the DataDomain Plug-in information 
Syntax to change the state of an OST(DD) disk volume:
nbdevconfig -changestate -stype -dp - dv -state  
  nbdevconfig -changestate -stype DataDomain -dp OST_DD_Pool01 -dv ddboost -state UP
- Command completed succesfully
nbdevquery -liststs -stype DataDomain -U
- Checked the State: DOWN 
nbdevquery -listdp -stype DataDomain -U
- Disk_Pool: OST_DD_Pool01 = UP - Storage Server: OST_DD_Storage01 = DOWN 
nbdevquery -listdv -stype DataDomain -U
- OST_DD_Pool01 = DOWN | Disk Volume name: ddboost - Flag: InternalDown 
No Changes? 
> bpstsinfo -si - Properly displayed the DataDomain storage server information
Opened the NetBackup Administration Console > Credentials > Storage Servers.
- Highlighted Storage Server OST_DD_Storage01 in the top pane
- In the bottom pane BOTH the master and media servers are listed as media servers for the Storage Server
Cause:Both the media server AND the master server were specified as "media servers" under CredentialsStorage Servers; and the OST(DD) plug-in and the NetBackup Credentials for the Storage Server had only been updated on the media server, but not the master server, and the mis-match in the "plug-in" and the "Storage Server" credentials between the two "media servers" configured for the storage servers cause the OST(DD) Disk Pool and Storage Servers to go to a DOWN state.

Solution



Copied the OST(DD) 2.5.0.3 plug-in installation files to the NetBackup master server, and performed the following actions: 
1. Upgrade the OST(DD) plugin on the master to version 2.5.0.3 - to match the version on the media server
Cycle the NetBackup services. 
To stop and start the NetBackup services on a Windows master or media server:
Open a command prompt to \NetBackup\bin and run:
bpdown -v -f
Then, start the services:
bpup -v -f
Note: On a UNIX/Linux server these commands would be executed:
/usr/openv/netbackup/bin/goodies/netbackup stop
#
 /usr/openv/netbackup/bin/goodies/netbackup start
 
2. Run the tpconfig -update command to change the OST(DD) credentials for the master server:
Open a command prompts to the \volmgr\bindirectory (UNIX/Linux: /usr/openv/volmgr/bin directory) and run:
tpconfig -update -storage_server -stype DataDomain -sts_user_id -password
3. Run the following NetBackup command to confirm the password change on the master server allows the master server to access the OST(DD) storage server: 
Open a command prompts to \netbackup\bin\admincmd (UNIX/Linux server:/usr/openv/netbackup/bin/admincmd):
bpstsinfo -si
Note: The bpstsinfo -si command (run from the master server) should display a similar output to what was displayed when this command was run on the media server. 
Wait a few minutes (10 to 20 minutes) before proceeding.
4.  Run the following command from the master server to check the current state of the OST(DD) storage servers, disk volumes, and disk pools.
Note: below are the states that should be seen after the above actions have been performed:
nbdevquery -liststs -stype DataDomain -U 
- State: UP
nbdevquery -listdp -stype DataDomain -U 
- Disk_Pool: OST_DD_Pool01 = UP - Storage Server: OST_DD_Storage01 =UP
nbdevquery -listdv -stype DataDomain -U 
- OST_DD_Pool01 = UP | Disk Volume name: ddboost - Flag: InternalUP