Tuesday, January 4, 2011

Exchange 2007 Database Clustering and High Availability Features

Continuous Replication Overview
Continuous replication is a new Exchange 2007 feature where the storage group's database and log files are copied to a secondary location. The storage group being accessed by clients contains the active copy of the database, and the storage group in the secondary location contains the passive copy of the database.
As new transaction logs are closed, or filled up, they are copied to that secondary location, validated, and then replayed into the copy of the database. The net effect is to provide you with a backup of the database that has already been restored to a mountable location before a disaster happens. This backup will be up to-date with all (or nearly all) transaction log replay already done. If the primary database is destroyed or unavailable, you can be up and running on the secondary copy within minutes.
To support continuous replication, transaction log file size is now 1 MB in Exchange 2007. In previously versions of Exchange, transaction log files were 5 MB.

Continuous replication (also known as log shipping and replay) is new technology in Microsoft Exchange Server 2007 that creates and maintains database copies to provide full high availability and disaster recovery solutions for mailbox databases. With the release of Exchange Server 2007 Service Pack 1 (SP1), there are now four forms of continuous replication.

Single Copy Clusters (SCC) SCC in Exchange 2007 is essentially the same as previous versions of Exchange clustering. This means that it still uses the shared storage model, where the actual mailbox and public folder databases only exist once within the storage infrastructure (hence the term single copy).
Local Continuous Replication (LCR) LCR is a single-server solution that uses continuous replication to create and maintain a copy of a storage group on a second set of disks connected to the same server as the active storage group copy.
Cluster Continuous Replication (CCR) CCR is a high availability feature that provides failover capabilities for e-mail service and mailbox data, using continuous replication to provide one copy of data redundancy with clustering technology for the server redundancy.
Standby Continuous Replication (SCR) SCR is a disaster recovery feature that employs continuous replication to create and maintain multiple copies of a storage group. It is provided as a feature to facilitate site resiliency.

1. Single Copy Clusters [SCC]
The first technology we'll look at is Single Copy Clusters (SCC). This technology will be familiar to those Exchange administrators who have implemented clustering in previous versions of Exchange since it's essentially the new name for 'traditional' Exchange clustering.
SCC in Exchange 2007 is essentially the same as previous versions of Exchange clustering. This means that it still uses the shared storage model, where the actual mailbox and public folder databases only exist once within the storage infrastructure (hence the term single copy). The individual cluster nodes of the SCC environment can all access the same shared data, but only one node at a time can actually use it.
Essentially in production an SCC requires a minimum of two nodes, a private link between each node (hear beat) a public connection to your local LAN, and a shared storage array – the diagram below depicts a very basic SCC cluster configuration:

The traditional idea behind this model is that when the primary node fails for any reason, all of the services that the primary node was responsible for will be passed over to the passive node, and normal operation of the Exchange server will resume.

To all intents and purposes the model above looks exactly like the clustering format that was used by both Exchange 2003 and Exchange 2000 – however in Exchange 2007 Microsoft introduced the following improvements:
• In Exchange 2003 when you had configured your Windows cluster, you would have the install and configure the clustered MSDTC – then install the Exchange 2003 binaries on the first node, then you would then have to manually in the Windows Cluster Administrator create the Exchange Virtual Server (EVS) IP address, Network Name, allocate storage and then create the Exchange Resources (MSExchangeSA) – however in Exchange 2007 SCC clusters – although you still need to have created an MSDTC resource – the rest of the process is fully automated.
• In Exchange 2003 the management of the Exchange Virtual Server (for example starting and stopping services) was accomplished via the Windows Cluster Administrator – in Exchange 2007 you can accomplish all of these tasks via the EMS (Exchange Management Shell) – additionally in Exchange 2007 SP1 (due very soon) the Exchange Management Console (EMC) will also provide this functionality – cluster and application administration all in one place!
• Again in Exchange 2003 when you had finally got you Exchange EVS up and running you would still have a number of little things that you needed to tweak – in Exchange 2007 all of this has been done for you (an example would be memory configuration – remember those “interesting” boot.ini and registry tweaks! )

Some concept changes:
In Exchange 2003 the common term for a clustered Exchange Server would be “Exchange Virtual Server – or EVS” – in Exchange 2007 the term is replaced with “Clustered Mailbox Server” – the reason being that Exchange 2007 clusters do not support roles such as CAS, HUB or Unified Messaging – they are purely mailbox servers – where as in Exchange 2003 your Exchange EVS would also support direct MAPI, OWA, and SMTP.
Each node in the cluster can be in a position to take control of the “Clustered Mailbox Server” – but like Exchange 2003 they still have and retain their own network identity – in essence each node will have a NETBIOS name, and IP address – but they can also take over and support the Exchange Virtual Instance in the event of a fail-over (whether this is manual or as a result of a hardware issue).
Another welcome change is that the concept of Active / Active clusters has been abandoned in full for all forms of clustering in Exchange 2007 – you can no longer have an Active / Active SCC cluster there are many reasons for this but essentially it boils down to scalability and performance – Exchange 2003 A/A clusters did not scale much beyond 1900 users, and could end up performing, as Exchange 2007 is 64 bit (for production), you can pile power into your Primary and Passive nodes, this makes the concept of “Load Balanced” fail-over in Active/ Active redundant.

Pros and Cons of SCC;
As in all scenarios there are pros and cons to any configuration – the following are the arguments for and against SCC clustering in Exchange 2007:
• It’s a familiar clustering model for those that have setup and configured Exchange 2000 and 2003 clusters
• Providing that the hardware is certified it is a pretty simple type of clustering to setup and configure
• Provides a reasonable amount of fault tolerance from a node perspective
• Good option for larger companies that are limited on sites – but have the money to invest in a locally fault tolerant solution
• Is typically expensive to setup – this is mainly down to the fact that shared storage is required between both the nodes – this is usually SAN based, but in a number of installations is SCSI – generally speaking you will required a significant hardware overhead to accommodate SCC
• The Shared storage is a single point of failure – lose the shared disk array = lose the cluster – unless you are employing some form of replication software across sites (more expense – and if you are you need to consider CCR)
• Due to the shared storage requirement of SCC both your cluster nodes need to be in the same location
• Requires an very specific hardware configuration to run on
• Requires Enterprise Versions of Exchange and Windows

2. Local Continuous Replication [LCR]
This is the first of the new continuous replication technologies available within Exchange 2007. The first and most obvious point to make about Local Continuous Replication (LCR) is the fact that it is a single-server solution and not a clustered solution. Therefore, LCR will not protect you against the failure of an entire server. Having said this, LCR does implement the new log shipping and replay functionality that Exchange 2007 provides. It does this by shipping the transaction logs generated by a storage group, known as the active copy, to another separate set of disks that are connected to the same server, referred to as the passive copy. Once the logs have been transferred to the alternate disks, they are replayed into a copy of the Exchange database that also resides on these disks. Thus, a separate copy of the database is maintained in near-real time fashion on the same server, and you therefore have data redundancy. Should there be a problem with the production database the administrator can switch over to using the backup copy of the database fairly quickly.

What do you need for LCR?
In order to make use of LCR your server should meet the following requirements:
• A Server capable of running x64 Exchange 2007
• The server should have x 2 independent RAID controllers (you can configure it using a single controller – but, if you lost that controller from the server then you will not get access to the replayed data).
• Separate storage per RAID controller (for example on the primary RAID controller you have a single Exchange Database sitting on a RAID 5 array and all of your Logs sitting on a Mirror – these will (and should) represent separate disks – this configuration should be replicated on your passive RAID controller
The following is a simplified diagram which depicts LCR operation – the orange areas of the diagram represent separate disks attached to separate controllers on a single server:

During normal operation when using LCR the active database’s logs are shipped to and then replayed into the passive database, in the event of a fault either on the Primary RAID controller or Primary disk array you can manually “Activate” the passive copy of the Exchange Data. The process of Activation can be accomplished via one of the following means:
• Changing the Active Storage group and database paths via the EMS (Restore-StorageGroupCopy) or EMC (Restore-StorageGroup task)
• Via the Operating System (reconfiguring Disk mount points / drive paths)

Pros and Cons of LCR;
• Great solution for smaller firms that have the money to invest in a single well spaced Exchange server
• Only requires the Standard Edition of Windows and Exchange
• For smaller enterprises it represents a good level of fault tolerance within a single box
• Easy to setup
• Not really suitable for larger organizations where mail is critical
• Does require a server that can handle enough disks and two RAID controllers for it to really be effective (this could put it out of SME’s price range)
• Can only contain one Database per LCR enabled storage group

3. Cluster Continuous Replication [CCR]
CCR makes use of a type of Windows Clustering called MNS (Majority Node Set) which is then combined with a new technology in Exchange 2007 which is part of CCR – called “Log Shipping” – there will be more on that later.
Whilst SCC offers you protection against server failure and LCR offers you protection against data failure, Clustered Continuous Replication (CCR) offers you both server and data protection. As you can guess from the name, CCR is the second of the new continuous replication technologies available with Exchange 2007. A CCR environment is a two-node cluster only, consisting of an active and a passive node. The key difference between a CCR environment and a SCC environment is that the CCR environment does not use shared storage. Rather, both nodes of the CCR environment have their own copies of the Exchange databases and transaction logs. The transaction logs from the active node are asynchronously copied to the passive node and replayed into the database. Should a problem occur with either the active node itself or the active node's databases, the Exchange server can be failed over to run from the previously passive node and its own copy of the databases.

How does it work?
Firstly before we go into the detail of how it works let’s have a quick look at the minimum requirements to implement CCR clustering:
Two clusters nodes which roughly meet the following criteria:
• Exist in the same rout-able subnet (unless you are running Exchange 2007 SP1 and Windows 2008)
• Have enough storage either based around DAS, ISCSI, or SAN – but it is sensible to ensure that each nodes storage is from a capacity perspective a match – remember – each node in this type of clustering uses its own storage to function – not a the shared array principle that we have seen used in Exchange 200 / 2003 and Exchange 2007 SCC
• A third server which can perform the role of the File Share Witness (or FSW) – this is normally installed on a Exchange 2007 Hub Transport, but can also work on any Windows server as a file share.
The idea behind CCR is that there are two copies of the Exchange database, one active (which resides on the storage of the primary node) and one passive (which resides on the storage of the passive node) – transaction logs from the active database are asynchronously “shipped” to the passive node’s database and then replayed to give you are fairly current copy of the data – more on this a bit later.
When you initially install the passive node in a CCR cluster each storage group and associated databases are copied from the Primary to the Passive node (this is called seeding) from there on in log files are shipped to the passive node and replayed on a constant basis.
Logs are shipped from the Primary Node to the passive node when are then “closed” – which results in the passive node not always having a copy of every single log from the primary node this can mean that the database on the passive node might not be totally up to date – however this can be rectified when you have resolved the issues with the Primary node and rectified them – then performed a fail-back.
There is an exception to the situation where your databases is not completely up to date which is when the Exchange Administrator issues the move-ClusteredMailboxServer command from the EMS (Exchange Management Shell) – this would normally be done when maintenance is required on the primary node – but a log Sync is performed between the node when this command is run.
A diagram is provided below which depicts a simplified version of how a CCR cluster can be configured over three sites (two nodes at two separate sites and a third site for the file share witness):

Pros and Cons of CCR;
• When using a multi site scenario it represents an excellent fault tolerant, and high availability solution with DR and Business Continuity
• Doesn’t specifically require an special hardware configuration
• Not tied to close proximity based clustering
• Doesn’t require third party replication tools
• Ideal for larger Exchange Organizations with multiple sites where you could locate an additional Exchange installation
• Not as simple to configure as Traditional Clustering
• Works best with multiple sites (from a DR and BC perspective)
• Requires the Enterprise version of Exchange and Windows 2003
• Can only contain one CCR enabled database per storage group

4. Standby Continuous Replication [SCR] – Service Pack 1 for Exchange 2007:
Standby continuous replication (SCR) is the big new feature in Microsoft Exchange Server 2007 SP1. SCR uses continuous log replication, or log shipping. You configure servers in a remote location (typically another data center) as targets to accept replicated transaction logs from source servers and to use the data to update local copies of mailbox databases. If catastrophe strikes the source server, you can use the copies of the mailbox databases to restore messaging service. The value of SCR is that it adds site resilience to the list of options you can consider when you plan for business continuity.

Essentially SCR allows for an Exchange Database to be replicated to a target elsewhere (different data centre / Exchange server) on a per storage group basis. One of the great things about SCR (and a key difference between LCR and CCR) is that you can replicate your data to multiple targets and multiple target types – for example: Your source Exchange Server can ship its data to an offline standby server in a geographically dispersed data-centre, whilst also shipping data to a specific storage group on an active Exchange cluster within your main building.The following diagram depicts a basic SCR scenario:

As you can see SCR has great potential as an additional line of defense from losing your data, however there are some things worthy of note about this configuration:
• The database and log paths MUST be the same on the source and target servers
• A target standby server must not have LCR enabled for any storage group contained on it
• A target must have the Exchange 2007 mailbox role installed (even if it not hosting any mailboxes)
• SCR can be administratively delayed
Again as with LCR the process of switching (or Activating) between Active and Passive copies of your database it manual operation.

Pros and Cons of SCR;
• Highly resilient and allows for multiple targets for your data
• Requires only the Standard Editions of Windows and Exchange
• Works for Enterprises of all sizes
• Allows for a built in delay in replication
• Can only be managed from the command shell (this means that it could be tricky to setup and manage)
• One database per storage group