Problem
Troubleshooting for when Robotic Drives are going into AVR mode, and backups are halting with a pending mount request.Error
TLD(0) unavailable: initialization failed: Control daemon connect or protocol error
Solution
The cause of this problem is most often a result of communication problems. There are two NetBackup daemons for robotic control: one runs on the machine with robotic control, the other runs on the machine that has drives in the robot. For example, if the robot is a TLD robot the two daemons are tldcd (runs on the server with robotic control) and tldd (runs on server with drives on the robot). In this commonly occurring problem, the drives will change from TLD control to AVR control. This is so the jobs will go into a pending mount state, rather than failing. That happens so that if network communications were to fail between two server for a short time, then there would be no need to fail the jobs and they could wait until the connection comes back up. However, at times this can be caused by more severe problems.
In the media server's system log would be an error such as this:
Dec 4 08:54:36 host01 tldd[260]: TLD(0) unavailable: initialization failed: Control daemon connect or protocol error
Dec 4 08:56:41 host01 tldd[858]: TLD(0) [858] unable to connect to tldcd on host02: Error 0 (0)
The above error is what will cause drives in a robot to go into an AVR control mode. This is because these two daemons are unable to communicate.
It is not possible to give a single cause or a single solution.
Some common causes are:
1. Network connectivity has just plain failed. In this case, the network must be restored.
2. There are multiple interfaces on one or both of the machines that cannot route to or resolve each other. In this case, either routing needs to be changed so that a request going will be able to reach its destination. Adding the proper host names to the /etc/hosts file has been shown to work in some situations.
3. The tldcd daemon has enters an uninterruptible state or is hung, thus making it unable for it to reply to tldd. In this case, shutdown the media management daemons by running/usr/openv/volmgr/bin/stopltid.
Next, run /usr/openv/volmgr/bin/vmps to get the pid (process ID) of the tldcd daemon and run a killcommand on it. If that doesn't work, use kill -9. If this does not kill the process, the server will have to be rebooted. To restart the daemons, run /usr/openv/volmgr/bin/ltid.
Note: The daemon does not time out because it is hung on a system call. This is something out of an application's ability to control.
4. The /etc/services file is missing the correct entries on one or both the servers. Below are the entries that should be in /etc/services:
# Media Manager services #
vmd 13701/tcp vmd
acsd 13702/tcp acsd
tl8cd 13705/tcp tl8cd
odld 13706/tcp odld
tldcd 13711/tcp tldcd
tl4d 13713/tcp tl4d
tshd 13715/tcp tshd
tlmd 13716/tcp tlmd
tlhcd 13717/tcp tlhcd
rsmd 13719/tcp rsmd
# End Media Manager services #
Note: Not only can this happen between two different servers, it can also happen on the same server.
In the media server's system log would be an error such as this:
Dec 4 08:54:36 host01 tldd[260]: TLD(0) unavailable: initialization failed: Control daemon connect or protocol error
Dec 4 08:56:41 host01 tldd[858]: TLD(0) [858] unable to connect to tldcd on host02: Error 0 (0)
The above error is what will cause drives in a robot to go into an AVR control mode. This is because these two daemons are unable to communicate.
It is not possible to give a single cause or a single solution.
Some common causes are:
1. Network connectivity has just plain failed. In this case, the network must be restored.
2. There are multiple interfaces on one or both of the machines that cannot route to or resolve each other. In this case, either routing needs to be changed so that a request going will be able to reach its destination. Adding the proper host names to the /etc/hosts file has been shown to work in some situations.
3. The tldcd daemon has enters an uninterruptible state or is hung, thus making it unable for it to reply to tldd. In this case, shutdown the media management daemons by running/usr/openv/volmgr/bin/stopltid.
Next, run /usr/openv/volmgr/bin/vmps to get the pid (process ID) of the tldcd daemon and run a killcommand on it. If that doesn't work, use kill -9. If this does not kill the process, the server will have to be rebooted. To restart the daemons, run /usr/openv/volmgr/bin/ltid.
Note: The daemon does not time out because it is hung on a system call. This is something out of an application's ability to control.
4. The /etc/services file is missing the correct entries on one or both the servers. Below are the entries that should be in /etc/services:
# Media Manager services #
vmd 13701/tcp vmd
acsd 13702/tcp acsd
tl8cd 13705/tcp tl8cd
odld 13706/tcp odld
tldcd 13711/tcp tldcd
tl4d 13713/tcp tl4d
tshd 13715/tcp tshd
tlmd 13716/tcp tlmd
tlhcd 13717/tcp tlhcd
rsmd 13719/tcp rsmd
# End Media Manager services #
Note: Not only can this happen between two different servers, it can also happen on the same server.
No comments:
Post a Comment