Reliable Exchange 2010 DAG: Part 1, Active Directory Health

One of the most awaited (and misunderstood) features added to Exchange Server 2010 is the disaster-recovery/high-availability feature known as the Database Availability Group (DAG.) Microsoft touts it as a perfected “2.0″ of the heretofore painful database replication process known as Continuous Cluster Replication (CCR,) which came into our lives in Exchange 2007. While the DAG does create a much simpler implementation path for administrators, there are some important design considerations that have the potential to create deep, painful “catches and gotchas.”

What follows is part one of a multi-part series on creating Database Availability Groups. Here’s a brief outline of what each part will cover, and as they’re released they will turn into links to those articles.

  • Part One: Active Directory Health
  • Part Two: DAG Planning and Architecture
  • Part Three: Installing, Configuring, and Securing Your Environment
  • Part Four: Backups and Disaster Planning

Active Directory Health is Key!

An absolutely flawless active directory environment is a totally non-negotiable pre-requisite for a successful Exchange Server 2010 implementation. Any small issue with replication, permissions, or global-catalog functionality could mushroom into a catastrophic roadblock to your Exchange 2010 project. Because of this immutable Law of IT Physics, it is worth the effort to review your active directory structure thoroughly before you start.

You can follow any number of other articles for benchmarking the active directory. Here’s a pretty-good guide to benchmarking your environment. This is something you should have done on day one, but for certain you want to do it before you begin altering the AD environment for Exchange Server.

You should also keep a document which describes the replication relationships between your domain controllers and various Active Directory sites. Understanding what your AD is expected to do (and why) before you have a problem will save you time later when you need to solve a problem or implement a feature.

This is only a basic list of things to review, Google can provide copious tips and how-tos for the down and dirty of each step–if the hue and cry is great enough I’ll put together my own series of posts for how to accomplish these things. But generally, I’m looking at the following items before moving on to altering the AD schema and installing Exchange.

  1. DNS Replication
  2. SRV Records and Corresponding Network Ports on Each DC
  3. Logs and the Network Administrators Who Love Them
  4. AD Topology

1. DNS Replication

On each DNS server, check for the existence of corresponding (matching, correctly replicated) A, PTR, NS, MX, SOA, and SRV records. Make sure to drill down to each inidividual AD site in the tree–replication is tricky, and just one record missing is a problem. You access this through Server Manager in Windows Server 2008, Computer Management in Windows Server 2003. Use the DNS tree, shown below for your amusement and graphical edification.

If DNS Replication doesn’t work, nothing else in AD is likely to work either.

2. SRV Records, AD Replication, and Corresponding Network Ports

Verify the existence of SRV records for Global Catalog, Kerberos, ldap, for each existing domain-controller in DNS. If you’re missing records you can either add them in manually and reboot all the DCs over a few hours, or demote the offending domain controller (the one whose records are missing from DNS) and them re-promote him. If you find wholesale missing records (whole servers missing any SRV records, for example,) you may have to consider taking more drastic actions: Manually adding in the DNS records rarely restores replication when it is broken this badly. A whole server missing SRV records on DNS often means a DCPROMO failed somewhere along the line and it wasn’t cleaned up properly–or somebody tried to clean up a failure and deleted the wrong records. Either way, this is where things get dicey.

How long the client’s active directory has operated in this manner will determine how much pain will be involved in the fix. If its recent and there have been a minimal number of active directory changes, (possibly even none) consider an authoritative restore to a date before the replication was broken on all of your DNS servers. You can use LDIFDE to export the domain controllers and confirm which (if any) objects are missing from a domain controller. If there are more than a few objects, or any critical objects, you may need to take additional steps like exporting mailboxes to PST on the existing Exchange, and backing up data for third-party applications interfaced to active directory tied to the users in question prior to performing a restore.

If you don’t have a backup prior to the date of the error, this is where it gets really dicey because you’ll have to follow one of several unsavory paths to AD health. Once you reach this point you can either demote the offending domain-controller forcibly, sacrificing whatever un-replicated objects it held to be recreated later manually by admins, or you can call Microsoft to see if they have a solution.

Once replication has been restored, users, or groups can be re-added to the domain, and the project can proceed.

Also, if DNS entries exist and you still don’t have total replication (i.e. errors continue to pop up when you force replication) you should then look at the DCs themselves. Use nmap or some other simple port-scanner to verify that the ports referenced in the SRV records are actually open on the servers specified in DNS. Verify that the server’s throwing errors are actually listening on the network. Look at switch logs for indications of physical layer issues like interface flapping that might indicate a failing NIC or Ethernet cable.

Troubleshoot your replication connectivity problems until you pass-out or it gets fixed–if you don’t, your project is doomed, I promise you. Microsoft has skilled agents that are worth the $280.

3. Its Log! Its Log!

Check Application, DNS, and Application logs on all domain controllers for irregularities, any recent or obviously recurring warnings or errors encountered should be examined and resolved before proceeding. Be sure leave your logs intact after any changes you make so you can compare the log output from before and after. Also, after implementing any changes, monitor logs for a few hours or days to insure you’ve actually solved the problem.

4. AD Topology

If your Active Directory spans multiple physical locations and includes a WAN, make sure your AD topology settings (specifically the replication “costs”) associated with each site-link accurately reflect the bandwidth available on those links. This will become critical later if you intend to have Exchange Servers at more than one of those sites, since the Hub Transport role uses site-link costs to calculate paths to other servers in the environment.

Another consideration for multi-site environments is the number and positioning of Global Catalog Servers (GCS.) Exchange Server relies heavily on domain controllers configured as GCS for many functions, from address-book generation to mail-routing, and it is critical that sites have either a local GCS, a reliable WAN link to the rest of the Forest, or both, to ensure up-time and stability.

Final Thoughts

Your active directory environment can derail your Exchange Server implementation very easily, wasting valuable time and pushing delivery dates. Taking the time to identify and iron out problems beforehand will go a long way towards a smooth implementation and a stable active directory environment for years to come..

Join us again next time for Part 2: DAG Architecture and Planning.

2 comments to Reliable Exchange 2010 DAG: Part 1, Active Directory Health

You must be logged in to post a comment.