|
|
|
Let us help you slay your dragons.
|
Disaster Recovery What is it? The following is a series of excerpts from projects that our team worked on. These are provided to demonstrate the complexity of the projects, along with the level and quality of effort supplied. Company names have been changed or removed to prevent any proprietary or security sensitive information from being shared. There will be improper page numbering and possible awkward sentence structure where the company name has been replaced with a generic name such as CompanyA and ConsultingCompany, please disregard these issues. This first image is from an approved IT design schematic for one of the said projects.
Security Audit Report TABLE OF CONTENTS Security / Convenience Tradeoff Scanner results and Hosts reports:
Process & Procedures Final Report – Phase I DR Execution StepsAs an overview, the following is a graphical representation of the steps necessary to execute a Disaster Recovery Plan. Fidelity of the process shall be presented later in this report. The DR Process and ProceduresThe nature of disaster is relative and contextual. What constitutes a disaster for company A may not be a disaster for company B. Each disaster will be different. It is difficult to plan and create DR procedures for each and every disaster scenario. Only if an event interrupts the means by which critical corporate functions are accomplished can it be termed a disaster. Given the fact that mission critical functions in the modern business world are accomplished using automated information systems, networks, and telecommunications equipment. A disastrous event is typically characterized by an interruption of the means by which business data is produced, processed, and/or distributed. Disasters are also defined in relation to time and is important from the standpoint of when an interruption occurs and how long the interruption lasts. Interruption of telephone service to a telemarketing company during a peak sales period is probably more disastrous than the same interruption during non-business hours. Data processing interruption that occurs during month-end processing creates a greater problem than one that occurs during the middle of the month, when processing demands are less. In both cases timing of the interruption plays a role in defining it either as a disaster or merely an inconvenience. Similarly, duration of an interruption defines its characterization as a disaster. A power outage of several hours may strain the data processing capability of a company without a backup generator, but with hard work and some expense, the company may be able to absorb this event without much harm. However, a power outage of many days, owning to a cut cable or a cataclysmic storm, may not be so readily tolerated and may have disastrous consequences. What constitutes an “unacceptable period of time” for the interruption of information services is a relative thing. It is based on a simple cost-tolerance effect. Over time, the cost to a company for an outage increases and the company’s tolerance to the interruption decreases. This fact has been demonstrated statistically by numerous academic and industry sponsored studies.
The natural question asked, based on the previous definition is: “How do I determine what constitutes a disaster for Holland & Hart”? There are many factors to consider and many ways to interpret and present the data, but this is one calculation that becomes very specific to a company. A basic ROI (return on investment) study needs to be accomplished before tolerance to a particular disaster can be determined. An ROI model and results of such an analysis are presented in the Final Report. Since one of the goals for Holland & Hart is to become a paperless company, the need for guaranteed systems is apparent, as is the increased need to be able to access critical data. These needs and how they can be met are outlined in the following graphic.
The previous graphic illustrates the two objectives: Recovery Point Objective and Recovery Time Objective. The following are the definitions of each term: · Recovery Point Objective (RPO): Point in time to which application(s) data must be recovered to resume business transactions. · Recovery Time Objective (RTO): Maximum elapsed time required to a complete recovery of application(s) data. Disaster recovery planning for Holland & Hart must be conducted with these two objectives in mind. Certainly, without disaster recovery planning, an interruption in mission-critical-business functions stands a much greater chance of translating into a disaster for business. Conversely, there is substantial evidence to support a case that those companies, which plan for the possibility of a disaster stand a far better chance of recovering from such a catastrophic event than those that do not. This is the ultimate rationale for disaster recovery planning. In case study after case study, companies that have survived disasters attribute their success to the implementation of pre-planned strategies by personnel who have been trained and have rehearsed their recovery roles. Even in these cases, however, there are common references to on-the-spot innovation, luck, and God to help explain recovery. Those who have experienced disasters are almost invariably humbled by the event, rather than viewing their recovery as a testimony to their planning skills and foresight. The disaster recovery plan itself provides certain logistical supports for recovery that would be difficult to assemble on the fly in the wake of disaster. These supports may include prearrangements for an alternative data processing site, for on-demand rerouting of data or voice communications traffic, for power, alternative work areas and recovery of system databases from off-site backups. Moreover, planning, training and testing forces those who will play a role in implementation to confront the dark side – the possibility of disaster – and to be better prepared to respond rationally to the chaos that often accompany crisis. Beyond these substantial contributions, effective disaster recovery planning, which includes the identification and implementation of disaster avoidance capabilities and articulation of disaster awareness training, may help to prevent avoidable disasters. Thus, the effectiveness of disaster recovery planning may also be measured in non-events – disaster potentials that have been minimized or eliminated. This rational for disaster recovery planning falls short of a blanket guarantee that a company will always recovery from any disaster. Still, it provides some measure of recovery assurance that is missing in the absence of such planning. In short, disaster recovery planning constitutes the only strategy at our disposal for coping with the unpredictable occurrence and consequences of disaster. The Disaster Recovery Circle has been presented to show the three stages of a Disaster Recovery Project: Documentation, Implementation & Risk Assessment, Testing and Rehearsal. This is a cyclical diagram because it includes a feedback loop for improvements to the Disaster Recovery Process and changes that must inevitably occur. Level 1 Discovery documents entire company’s environment that you wish to protect, including LAN, WAN, Telecommunications, and facilities. Level 2 is the development of a business continuity strategy followed by implementation. Level III is the testing and rehearsal of the proposed disaster recovery plan. From this a company is able to determine their level of preparedness. If this level of preparedness does not meet the projected goals either the goals need to be re-evaluated or modifications to the implementation plans must occur. This requires that changes be documented, new implementation plan developed, execution of the new implementation plan tested and evaluated, and subsequent re-evaluation until desired goals are achieved.
This line drawing represents the process flow of steps necessary to own and operate a disaster recovery plan. It identifies the major processes in developing and maintaining a DR Plan. You can see that each of the plans are integrated together and rely on each other. To begin, some event must have happened before the disaster plan may be executed. Therefore, begin with Step 0. Upon occurrence of an event, first assess extent of damage, estimate outage duration, and assess condition of the facility or service. Hypothetically, based on this assessment, a Disaster Recovery Plan is now activated with the declaration of a disaster.
Upon Declaration of a Disaster the following steps must be performed to activate and notify the recovery staff. Recommendations Final Report – Phase I Table of Contents Proposed Business Continuation and Disaster Recovery Plan DR and Business Continuation Option #1 DR and Business Continuation Option #1 Recommendations DR and Business Continuation Option #2 DR and Business Continuation Option #2 Recommendations Network Architecture Environment System Management Resource Folders (SMRF) Denver Data Center Power Infrastructure: Denver Telephone Closet Power Infrastructure: Business Final Report Phase I Executive SummaryCompanyA has requested that ConsultantCompany propose a solution to develop a Disaster Recovery (DR) Plan. This DR plan will comprehensively assess the 11 remote and primary office locations, which contain approximately 48 servers and over 250 PC’s and provide recommendations to improve their current environment. There are three levels to a DR plan:
CompanyA’s Information Systems (IS) staff should be complemented for having produced an above average network. Understanding the radical changes that have occurred in the past year, the new network is extremely functional, stable and secure. The speed and reliability of the interconnections between remote offices is fast. The age of a majority of the computing equipment is superb. But, as in all situations of this kind, in the haste of implementing this network several issues have been overlooked. The current state of the firewall is seriously out-of-date. The power infrastructure to support the Denver data center needs to be reworked. The data protection strategy with backups is inadequate. Internal efforts are being made to address two of these three issues. Further recommendations are included to address these issues and to radically increase the availability of all key IS services with redundancy. A new strategy is unveiled in this report and its companion reports that will provide more up-time and increase the availability of all key IS Services to local and remote office personnel. ScopeThis effort covered Level I in its entirety for all networking devices within CompanyA’s direct control, documentation of their web site hosting service’s DR plans, recommendations and only that portion of Level II which deals with development of an implementation plan and development of an outline for DR procedures based on common business continuation strategies. Implementation of policies and procedures will be properly quoted and accomplished after CompanyA has determined their acceptable risk level. This hybrid of Level I and Level II DR processes was termed Phase I of a multi phased DR project. While performing an exhaustive evaluation and analysis of the CompanyA IS environment the following was accomplished, with recommendations provided where necessary:
CompanyA Business Continuity ObjectivesCompanyA has two primary objectives for their IS environment: Move to a paperless office environment and ensure that all data and critical server services are not down for more than one hour. To achieve these goals CompanyA has purchased significant hardware and software to receive and convert documents in electronic format. This would include fax machines, scanners and software to maintain proper filing and record keeping of all documents brought into the company. ConsultantCompany is providing the solution for the company critical uptime requirements. To this end, the documentation provided from this effort set forth a plan and recommendations. BackgroundTerms & Concepts Used
Table 1
HSM adds to archiving and file protection for disaster recovery the capability to manage storage devices efficiently, especially in large-scale user environments where storage costs can mount rapidly. It also enables the automation of backup, archiving, and migration to the hierarchy of storage devices in a way that frees users from having to be aware of the storage policies. Older files can automatically be moved to less expensive storage; if needed, they appear to be immediately accessible and can be restored transparently from the backup storage medium. The apparently available files are known as stubs and point to the real location of the file in backup storage. The process of moving files from one storage medium to another is known as migration. An administrator can set high and low thresholds for hard disk capacity that HSM software will use to decide when to migrate older or less-frequently used files to another medium. Certain file types, such as executable files (programs), can be excluded from those to be migrated.
Failure ModesThe following is a list of primary causes of data loss for any given LAN/WAN environment. Information for the main bullets below comes from a white paper written by Fred Moore, President of Horizon Information Strategies, titled Storage Infinite Disruptions, with information from StrategicResearch Corp. Sub-bullet information is provided as explanation with some inputs coming from Veritas. · Hardware 44% o Data path errors (DMP, switches, routers) o Disk crashes o Power supplies, interface boards, RAM o Miscellaneous · Human Error 32% o “Fat Fingering” o User, administrator and operator errors o Lost files from inadvertent deletions and updates o Mis-configured workstations, servers, DR storage devices … o Exploratory and untested configurations to accommodate new software or function. · Software 14% o Inaccurate program updates o Incorrect run descriptions and scheduling o Spy-ware o New installation · Virus 7% o Attacks to Routers, Servers, Workstations and telephone systems o Denial of Service (DOS) attacks o Worms – destructive programs that destroy existing data · Natural Disaster 3% o Water damage o Fire damage o Smoke damage o Earthquake o Tornado As a part of natural disasters, the number of levels in a building needs to be considered. Disasters above or below the CompanyA Denver offices, or disasters adjacent to remote offices, can cause sufficient damage to either prevent immediate access to CompanyA work area or destroy them all together |
Send mail to
webmaster@austinit.net with
questions or comments about this web site.
|