|
Our inter-connected
multiple-building architecture with fully independent infrastructure
(separate power grids, networks, etc) provides more flexible
and affordable high availability options. For customers with
mission critical transaction or data, a high-availability
solution can be designed to meet your technical, business,
and financial objectives.
MindCentric can help you determine the
correct high-availability solution for your business. If you
need high availability, business continuity, disaster avoidance,
disaster recovery, data security, dynamic failover, database
failover, application continuity, or remote standby site or
database standby, we can help by providing the technical pros
and cons, and budget information you need to determine the
best solution for your business.
Below are four scenarios to
help you understand some of the basic options:
- Scenario
1, High-availability within a single site.
- Scenario
2, Multi-site high availability for read-only applications.
- Scenario
3, Multi-site failover to a passive site for rapid recovery
in case of site failure.
- Scenario
4, Multi-site high availability for transactional applications
with either application-level transaction replication or
database-level data.
Scenario
1. High-availability within a single site.
Scenario 1.
High availability in a single site, transactional applications.
Failure of the whole site is the least likely scenario, so
a major focus of any effort that involves a transactional
application should be making a single site highly available.
This is important, because the most difficult aspect of running
redundantly in two locations involves maintaining two synchronized
copies of the database. Key features include web and application
server farms, a database failover cluster, and RAID-5 or 1
redundant storage.
There are a number of incremental
costs to adding to a single site to achieve high availability:
• Load balancers, primary and backup. If the baseline
application is such that multiple web servers are needed for
scalability reasons, then load balancers will already be in
place.
• An additional web server beyond what is needed to
meet capacity.
• An additional application server beyond what is needed
to meet capacity requirements.
• An additional DBMS server with the same capacity as
the primary server to handle the entire workload in case of
failure.
• RAID 5 or 1 array, depending on performance requirements.
RAID 5 will require 12% more DASD than a non-RAID environment,
while RAID 1 requires 100% more DASD. The RAID subsystem itself
is expensive, but becomes a smaller percentage of the total
cost as the amount of data increases.
Outstanding risks:
• Load balancers and failover equipment do not catch
all possible software failures.
• Application design must account for session failover
in case an application server node fails.
• Database clustering failover will result in an outage
of service varying from 3 to 20 minutes.
• RAID will not prevent corruption of the database.
Normal database backups are still necessary.
• No high availability in case of site failure (including
degradation of performance) or an environmental problem affecting
users’ access to the site.

Scenario 2,Multi-site
high availability for read-only applications.
Scenario
2. High availability, multi-site redundancy, in active/active
mode, read-only applications. This scenario includes a copy
of the entire environment at a geographically remote site.
Since the data is read only, both sites can simultaneously
process user queries and a geographic load balancer is used
to distribute requests to the least-loaded site. In the event
of a major site failure, all traffic is automatically routed
to the available site. Batch updates to refresh the read-only
content are applied simultaneously to the two sites. This
scenario is often employed for ensuring high-availability
of static websites, e.g., the existing IXC public websites.
The incremental cost for adding multi-site
high availability to Scenario 1, for read-only applications:
• Cost of building out a second site, facilities and
network.
• Geographic load balancers, primary and backup.
• Double the number of web servers required for a single
site. If degraded performance can be tolerated in the event
of a site failure, then less than 2X the number of servers
are needed. For example, for a site that requires 10 servers
in a single location, the multi-site scenario may require
only 14 servers, 7 in each location. It may be acceptable
to the business owners that in the event of a site failure,
only 70% of capacity will be available. A major benefit is
that in this scenario, the extra capacity is always online
and able to handle unplanned spikes in demand during normal
operations.
• Double the number of application servers required
for a single site. The implications for employing less are
the same as for web servers.
• Duplicate DBMS cluster. As with web and application
servers, each of the database machines does not need to have
the ability to support the full workload if the business can
afford to operate in degraded mode for the duration of a site
outage.
• Duplicate data storage facility, plus a development
effort to handle updating the two databases and making sure
they are exact copies of each other. Depending on the size
of data and the amount of batch updates, this can be a small
or very large effort.
Outstanding Risks:
• Session management across both sites is not possible,
so if session state is important, the application must employ
a sticky bit in the geographic load balancers to ensure a
user is routed back to the same site for the duration of an
entire session.
• The geographic load balancers must be able to monitor
events that define a site failure, e.g., unpingable local
load balancers. Any other event must be handled by the high
availability local infrastructure.
• Introducing two copies of the data increases the risk
that an operational error will result in their being out of
sync and providing users inconsistent results.

Scenario 3, Multi-site failover to a passive site
for rapid recovery in case of site failure.
Scenario
3. High availability, multi-site redundancy, active/passive
mode, transactional application. Because of the difficulty
of implementing Scenario 4, this scenario is introduced as
a simpler, less costly alternative for transactional applications.
This scenario is based on the assumption that a complete site
failure is the least likely unplanned outage, and in the unusual
circumstance of a disaster occurring, it is acceptable to
the business to tolerate a brief outage as the passive (inactive)
failover facility is brought online.
This approach builds on Scenario
2, but the second site is not active and storage-level replication
is used to keep the remote database up to date. If the DBMS
is Oracle, Microsoft SQL, or MySQL, facilities are available
to keep the remote database online in a standby mode, which
will reduce recovery time in the event of a failure. When
a site failure is detected, all systems can be automatically
started via remote clustering support.
There is just one incremental
cost for adding transactional capability to Scenario 2 (geographic
load balancing is not needed):
• Remote replication software and/or hardware that can
support it.
Outstanding Risks:
• Careful planning is necessary for a remote recovery.
The remote site must first be brought online and then the
DNS routers have to be re-pointed to the secondary site.
• Detecting appropriate events that signal site failure
requires careful implementation and testing.
• Remote clustering tools are not mature.
• The cost of maintaining spare equipment for the redundant
site.
• The additional overhead of a wide area write can adversely
affect the performance of high-transaction applications.

Scenario 4, Multi-site high availability for transactional
applications with
either application-level transaction replication or database-level
data
Scenario
4. High availability, multi-site redundancy, active/active
mode, transactional application. This is the most difficult
scenario to implement, because two independent databases must
be kept in sync as users at either live site submit transactions
that write to their respective local databases. This solution
is not widely employed and involves a large development effort,
with an exhaustive test cycle to ensure that the resulting
very complex environment is correctly implemented.
Storage-level replication scenarios
are not appropriate in this situation, because both databases
must be online and processing transactions. Two programming
approaches are available: one involves transaction-level replication
and the other database replication. Only transaction-level
replication, where both databases are simultaneously updated
in a single, two-phase commit transaction, ensures complete
real-time synchronicity of the two environments. However,
this cure may be worse than the disease, because the most
likely failure would not be a site failure, but a failure
related to the complexity of the environment that must be
implemented to support the distributed update. Programming
to accommodate this failure scenario is complex, involving
queuing and restoring the failed transactions, e.g., queuing
up failed transactions in a redundant queuing environment.
Performance may also be an
issue, since each transaction would include an additional
remote write. An alternate approach would be to employ asynchronous
remote updates, either at the transaction level by writing
programs to queue the remote updates or by defining two-way
database replication schemes. The latter has been employed
by major banks, where Oracle has reported that it has achieved
a transaction rate of 20 transactions per second in a financial
application. This approach introduces a time interval where
the two databases are not in sync (before the remote asynchronous
updates have been applied) and the application team must address
this.
The incremental cost for adding
transactional capability to Scenario 3:
• Cost of developing and testing a complex program modification.
• Cost of developing database consistency checks.
• Cost for a queuing subsystem if the application development
approach is used. Database replication is included in the
cost of the database management system.
Outstanding Risks:
• The complexity of the solution introduces additional
points where the system can fail. The fact that this scenario
is not widely implemented indicates that this is not a trivial
development effort.
• With some approaches, there will be an interval, which
cannot be exactly determined, where the databases will be
out of sync.
• The additional overhead of a wide area write can adversely
affect the performance of high-transaction applications.
|