Home > Windows Server Tips > Active Directory Administration > Case Study: Troubleshooting Distributed File System Replication
Windows Server Tips:
EMAIL THIS
 TIPS & NEWSLETTERS TOPICS 

ACTIVE DIRECTORY ADMINISTRATION

Case Study: Troubleshooting Distributed File System Replication


Gary Olsen, Contributor
07.10.2007
Rating: --- (out of 5)


Expert advice on Active Directory and Group Policy
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


Gary Olsen
Gary Olsen
Although Microsoft DFSR (Distributed File System Replication) has some great advantages, troubleshooting DFSR problems can be a bit of a mystery. Part of the problem is that most administrators who are used to File Replication Service (FRS) often try to debug DFSR problems using FRS techniques.

The problem

Recently I was consulted about a situation in which a failure occurred on a disk where DFSR data was replicated to. After restoring the disk, it appeared that the staging files were filling up with backlogged files and that DFSR was replicating backwards. Basically, it seemed to be replicating from the target back to the source and causing data loss.

More on Microsoft's Distributed File System
Using the new DFS in Windows Server 2003 R2

Distributed File System feature prioritizes target servers in Active Directory

Understanding DFSR for easy configuration of Active Directory replication groups
The configuration was a hub site with a server, attached to a storage array. There were 20 or so remote sites, each with a server hosting network shares that users saved files to. DFSR was used to replicate each site's share data to a corresponding share on the hub server's storage array, which was then backed up.

The problem seemed to start when a disk drive on the storage array failed. At that same time, a user complained that the file he was accessing was "old" data. That is, the changes he made the previous day were not there, as if the old file had overwritten his newer version. The admins had developed a script they used to determine the health of the Distributed File System. This script showed a large number of files in the staging directories on several shares at the hub site.

Searching for a data loss fix

The administrators believed the staging directories should be empty. They also believed that the files were going from the target server's share, back to the staging directory and then back to the source, thus putting old data back and replacing newer data on the source shares. To prevent any data loss, the admins disabled replication for three replication groups.

In addition, they wanted to configure the replication groups for one-way replication, so it would only replicate from source (remote site) to hub. After all, why would you ever want to replicate from the hub to the remote site?

To summarize, it all broke down like this:

  • A disk, hosting DFSR replicated shares, failed.
  • Old data apparently was being replicated from target to source.

  • There were several DFSR shares that had upward of 250,000 "files" in the staging directories. This was interpreted as a problem, as it was in FRS.

  • The admins disabled replication on four replication groups to prevent old data from replicating over new data.

  • They wanted to solve this by disabling replication from the target to the source.

  • The admins had tried to force replication in the right direction by using the DFSRadmin command to specify the isPrimary flag (as I discussed in my previous article on configuring replication groups).

  • To troubleshoot this problem, let's look at each of these points individually and see where it takes us.

    First of all, the only way a file at the target could overwrite the same file at the source is if the target's version was modified and had a timestamp newer than the source's version. It cannot replicate old data. After checking further, administrators found that only one user had that problem, so they attributed it to a user error.

    As far as the "files" in the staging directory, there were not only files in there, but they had changes in them as well (i.e., RDC signatures, RDC hashes and USN Journal data, not to mention file data).

    There is not necessarily a 1:1 relationship between entries in the staging directory and the physical files. Data in the staging directory is used to determine if the file on the source or target is more recent, and to then replicate accordingly. If you have a dynamic environment with large amounts of data being replicated, it's not unusual to see a large amount of data in the staging directory. Therefore, the admins assumed that something was wrong.

    Next, they wanted to modify the replication link to be unidirectional and replicate from source to target only. Microsoft strongly warns against doing this, as it would prevent proper evaluation of the files and would probably break replication.

    Note: The concept of "hub and spoke" is in the mind of the administrator. DFSR just replicates the newest data -- wherever it is -- to the other end of the replication link. It is multi-master replication and you should not attempt to change it.
    It's a waste of keystrokes to use the DFSRadmin "isPrimary" flag to kick-start replication. This flag is only for the benefit of initial population of the target share. After the share is populated upon creation of the replication group, you can set this flag all you want -- but it won't make any difference. Once initial replication has occurred, this flag is automatically cleared. Setting isPrimary manually will only help if initial replication is not working.

    The solution

    At this point we have debunked a number of faulty assumptions that the admins made in diagnosing this problem. The fact is, the only real problem was the failed disk. While it was offline, DFSR was working just the way it was supposed to -- saving all the changes in the staging directories. This was a good thing. If the admins had realized that and consequently discovered that the "old data overwriting new" was only one user, and just left everything alone, the problem would have self-healed.

    In the end, the solution was simply to get the disk back online and restore the backed up data, then enable the replication links and let it all converge. In fact, DFSR really is quite self-healing, and is built to handle a large amount of data.

    Here are some good tips for troubleshooting Distributed File System Replication:

    1. There is a specific DFSR event log that will appear on DFSR servers. Use this event log when looking for errors and warnings related to DFSR.
    2. To get a good DFSR health check, use the DFSRadmin utility's "health" parameter. This tool was engineered by Microsoft to give administrators everything they need.
    3. dfsradmin health new /rgname:dfs_data /refMemName:SRV1 /repname:c:\dfsreports\SRV1-DFShealth.html /fsCount:True

      Where:

      Rgname -- the replication group name
      refMemName -- the name of the server
      rename -- the name of the report
      fscount -- specifies whether to count the files in each folder
      You can get help for this command with: dfsradmin health new /?

      This puts an HTML (and optionally an XML) file in the DFSReports directory and allows you to use these files to easily script a program to collect reports from all Distributed File System servers.

    4. Of course, the best troubleshooting tool is knowledge. By referring to Microsoft's DFS Web site, you'll find a plethora of help articles, including:
      * A collection of DFSR frequently asked questions
      * An excellent step-by-step guide for the DFS solution in Windows Server 2003 R2
      * An information-stocked document titled Designing Distributed File Systems

    Do you have an Active Directory issue or problem that you'd like Gary to write an article about? Email him at glo11749@yahoo.com. Note: Gary cannot answer each query personally or guarantee that all will be answered. However those queries that have widespread interest or involve common AD issues will be addressed.

    Gary Olsen is a systems software engineer for Hewlett-Packard in Global Solutions Engineering. He authored Windows 2000: Active Directory Design and Deployment and co-authored Windows Server 2003 on HP ProLiant Servers. Gary is a Microsoft MVP for Directory Services and formerly for Windows File Systems.


    Rate this Tip
    To rate tips, you must be a member of SearchWindowsServer.com.
    Register now to start rating these tips. Log in if you are already a member.




    Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


    RELATED CONTENT
    Microsoft Active Directory Tools and Troubleshooting
    Active Directory FAQs
    Troubleshooting Active Directory database errors
    Troubleshooting a cross-forest trust in Active Directory
    Time stamps change with daylight-saving time
    Unwinding USN rollback when faced with AD replication failure
    Solving Active Directory replication failure
    ReplMon still tops for troubleshooting Active Directory replication
    Limiting LDAP searches with MaxPageSize
    Sysinternals' Active Directory Explorer tool searches AD databases
    When authentication fails: Troubleshooting Windows time services

    Windows File Management
    Windows registry hack improves offline file access for mobile users
    How to format NTFS: More tricks to improve file system performance
    Windows scripting secrets for disk quota management
    Optimizing NTFS file system performance
    How to receive automatic notification of file changes
    Identify file extension types with TrID
    Windows System File Checker helps stop system failures
    How to reverse NTFS object ownership from administrators to object's creator -- and why
    Use PageDefrag to defragment immovable system files
    Creating discrete pagefile volume increases system performance
    Windows File Management Research

    Active Directory Administration
    Troubleshooting Active Directory database errors
    Active Directory database basics: Performing an offline defrag
    Branch office security: Pros and cons of read-only domain controllers
    Tips for Windows domain controller optimization
    How to rebuild the SYSVOL tree when none exists in Active Directory
    Unwinding USN rollback when faced with AD replication failure
    Solving Active Directory replication failure
    How to index standalone printers in Active Directory
    Can Active Directory benefit from 64-bit technology?
    Troubleshooting account lockouts in Group Policy

    RELATED GLOSSARY TERMS
    Terms from Whatis.com − the technology online dictionary
    NTFS  (SearchWindowsServer.com)

    RELATED RESOURCES
    2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
    Search Bitpipe.com for the latest white papers and business webcasts
    Whatis.com, the online computer dictionary

    DISCLAIMER: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.



    Server Room Design - Planning, Cooling, Maintenance
    HomeTopicsITKnowledge ExchangeTipsAsk the ExpertsMultimediaWhite PapersIT Downloads
    About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
    SEARCH 
    TechTarget provides enterprise IT professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective IT purchase decisions and managing their organizations' IT projects - with its network of technology-specific Web sites, events and magazines.

    TechTarget Corporate Web Site  |  Media Kits  |  Reprints  |  Site Map




    All Rights Reserved, Copyright 2004 - 2008, TechTarget | Read our Privacy Policy
      TechTarget - The IT Media ROI Experts