May 30, 2013

Troubleshooting for when Robotic Drives are going into AVR mode


Problem

Troubleshooting for when Robotic Drives are going into AVR mode, and backups are halting with a pending mount request.


Error



TLD(0) unavailable: initialization failed: Control daemon connect or protocol error

Solution



The cause of this problem is most often a result of communication problems. There are two NetBackup daemons for robotic control:  one runs on the machine with robotic control, the other runs on the machine that has drives in the robot. For example, if the robot is a TLD robot the two daemons are tldcd (runs on the server with robotic control) and tldd (runs on server with drives on the robot). In this commonly occurring problem, the drives will change from TLD control to AVR control.  This is so the jobs will go into a pending mount state, rather than failing.  That happens so that if network communications were to fail between two server for a short time, then there would be no need to fail the jobs and they could wait until the connection comes back up.  However, at times this can be caused by more severe problems.

In the media server's system log would be an error such as this:
Dec 4 08:54:36 host01 tldd[260]: TLD(0) unavailable: initialization failed: Control daemon connect or protocol error 
Dec 4 08:56:41 host01 tldd[858]: TLD(0) [858] unable to connect to tldcd on host02: Error 0 (0)

The above error is what will cause drives in a robot to go into an AVR control mode.  This is because these two daemons are unable to communicate.

It is not possible to give a single cause or a single solution.

Some common causes are:

1. Network connectivity has just plain failed. In this case,  the network must be restored.

2. There are multiple interfaces on one or both of the machines that cannot route to or resolve each other. In this case, either routing needs to be changed so that a request going will be able to reach its destination.  Adding the proper host names to the /etc/hosts file has been shown to work in some situations.

3. The tldcd daemon has enters an uninterruptible state or is hung, thus making it unable for it to reply to tldd. In this case, shutdown the media management daemons by running/usr/openv/volmgr/bin/stopltid.

Next, run /usr/openv/volmgr/bin/vmps to get the pid (process ID) of the tldcd daemon and run a killcommand on it. If that doesn't work, use kill -9. If this does not kill the process, the server will have to be rebooted. To restart the daemons, run /usr/openv/volmgr/bin/ltid.

Note: The daemon does not time out because it is hung on a system call. This is something out of an application's ability to control.


4. The /etc/services file is missing the correct entries on one or both the servers. Below are the entries that should be in /etc/services:
# Media Manager services #
vmd     13701/tcp       vmd
acsd    13702/tcp       acsd
tl8cd   13705/tcp       tl8cd
odld    13706/tcp       odld
tldcd   13711/tcp       tldcd
tl4d    13713/tcp       tl4d
tshd    13715/tcp       tshd
tlmd    13716/tcp       tlmd
tlhcd   13717/tcp       tlhcd
rsmd    13719/tcp       rsmd
# End Media Manager services #

Note: Not only can this happen between two different servers, it can also happen on the same server.

No comments:

Post a Comment