June 04, 2015

DD Replication Sizing Guide

Replication Sizing Guide

PURPOSE

There are a number of factors to be taken into consideration when setting up replication. This article explains the considerations.

Disk Space Capacity
Network Bandwidth
CPU Performance
APPLIES TO

All Data Domain systems
Software Releases versions 4.7 and above
Replication
SOLUTION

When configuring two machines for replication, one of the parameters essential to proper configuration is the need to ensure that the machines to be used are adequately sized for the job. In this regard, it is necessary to check the size of the file systems on the source and target machines to make sure that the target is sufficient in size to handle the data load that is placed upon it by the source machine.

In the case of directory replication, it is also necessary to ensure that the processing power of both machines is fairly equal in order to reduce the possibility of encountering significant lag between source and destination databases.

Determine the system capacity. At the Data Domain system prompt type:
filesys show space

==========  SERVER USAGE   ==========  
Resource             Size GiB   Used GiB   Avail GiB   Use%   Cleanable GiB*  
------------------   --------   --------   ---------   ----   --------------  
/backup: pre-comp           -        0.0           -      -                -  
/backup: post-comp    32877.8        0.0     32877.8     0%              0.0  
/ddvar                  189.0        8.7       170.7     5%                -  
------------------   --------   --------   ---------   ----   --------------  
Figure 1

This command gives a command line graphical display of the system size in GiB.

Replication System and Directory Size Restrictions

The available space (denoted in red) displays the system disk space that is available for backup data. In this example the total system size of this machine is 32877.8 GiB. If this output was from a destination Data Domain system configured for collection replication 181589 then the source Data Domain system can be any size up to 32877.8 GiB (it cannot exceed this value). If configured for directory replication 181594, the size of the source Data Domain system can be larger than the destination Data Domain system; However, the directory to be replicated at the source directory cannot be larger than the destination directory. To summarize, the destination location for replication must be equal to, or larger than, the source replication context. The following subsections describe factors that affect the performance of collection and directory replication, and provide guidelines for sizing systems to meet performance goals.
TCP/IP Performance Considerations

Both collection and directory replication use TCP/IP for networking; therefore their performance is also limited by the performance of TCP/IP. In particular, TCP/IP handles dropped packets poorly and has difficulty handling high bandwidth-high delay networks. Packet drop rates as low as 0.1% severely degrade network throughput, particularly for high bandwidth-high delay networks. If the network drops packets, the only work around is to use a WAN acceleration appliance such as those provided by Cisco, Silverpeak, Juniper or Riverbed, among others.

Networks with bandwidth <= T2, and RTT (Round Trip Time) up to one second provide good throughput.
Networks of >= T3, will encounter significant throughput degradation starting with a RTT of 300-500ms.
More generally, throughput under packet loss is approximately:

Throughput = MSS /(RTT * sqrt(p))

MSS minimum segment size (typically 1460 bytes)
RTT round trip time
p probability of packet loss
Collection Replication Considerations

Collection replication can usually saturate the network link up to about 70MB/s in network throughput, and is generally insensitive to network RTT and load on the source and destination.

Collection replication replicates two types of containers.

Data containers generated by user writes
Recipe containers generated by cleaning
Cleaning copies live data from existing containers to new, more compact containers. Recipe containers do not contain any data, just lists of segment fingerprints, which are processed and reconstituted at the destination. Recipe processing is highly I/O intensive. When processing recipe containers, the network bandwidth used by collection replication drops significantly.

Directory Replication Considerations

Directory replication throughput can be limited by both the available network bandwidth and by the filtering/packing process. The filtering/packing overheads are proportional to the amount of logical data to be replicated. Directory replication, therefore, has two throughput limits to keep in mind. The first is the network or post-compressed (post-comp) throughput, and the second is the logical or pre-compressed (pre-comp) throughput. It is important to consider both limits when sizing systems.

In addition to network and filtering/packing limits, directory replication throughput is higher when using multiple contexts and can vary significantly depending on the level of compression, data locality, and load on the source and destination systems. The following shows the ideal single and multi-context pre-comp throughput by model:

Ideal single and multi-context pre-comp throughput by model
Ideal single context pre-comp throughput: DD690+3xES20 => 137MB/s
DD460 => 80MB/s
Ideal multi-context pre-comp throughput: DD690+3xES20 => 430MB/s
DD580 => 260MB/s
DD460 => 200MB/s
Note: Sustained throughput for typical user environments is 25-50% lower.

Due to characteristics of the directory replication protocol, the pre-comp throughput is also reduced by high RTT (Round Trip Time), particularly for high-bandwidth networks.

For >= T1 networks, there will be significant throughput degradation starting at RTT >= 300ms.
For >= T3 networks, there is significant throughput degradation starting at RTT >= 50ms.
Because of the packing/compressing overhead on the source restorer, using a more CPU intensive local compression algorithm such as gz or gzfast instead of the default lz algorithm, can dramatically reduce replication throughput.

A good way to verify TCP/IP throughput is to do the following:

At the Data domain system prompt (4.7 and above) on the destination system type:
net iperf server port 2051

At the Data domain system prompt (4.7 and above) on the source system type:
net iperf client port 2051

Using /dev/urandom compensates for WAN accelerators like Riverbed, but limits performance to about 50Mbit/s (T3).

Replication uses TCP port 2051, which must be open through firewalls.
Verify that the available network bandwidth is sufficient to replicate the expected rate of post-comp changes.
Verify that the pre-comp throughput is sufficient to replicate the expected rate of logical (pre-comp) changes, use 25-50% of ideal throughput just to be safe.

No comments:

Post a Comment