Showing posts with label High Availability. Show all posts
Showing posts with label High Availability. Show all posts

Friday, February 3, 2017

The High Availability Conversation Part 1: Introduction

Hello Dear Reader!  Twitter is a fantastic tool if you use SQL Server.  It is a great way to network, find presentations, interact with experts, and #SQLHelp offers some of the best technical forum conversation on the subject.  I was on Twitter and my friend Kenneth Fisher asked a question.

Kenneth’s question was interesting.  Let’s define a couple terms up front.  HA in this case is High Availability.  It is the capability for a database or data store to maintain availability and connectivity to a Graphical User Interface, Web Service, or some other data consuming application despite a localized outage or failure of the primary system.  

High Availability is different from Disaster Recovery, or DR.  Disaster Recovery is required when a disaster, natural or man-made, prevents the use of resources within a data center and the hardware within to support regular business processes.  In this blog, we will be addressing HA only as that was what Kenneth had asked about.

Asked for further explanation Kenneth didn’t need HA at his environment, more on this in a moment, he was looking for a rounded point of view.  I had a couple replies and then started direct messaging, DM`ing, Kenneth to share my insight.  This is a conversation you must have with the business.

  A keen understanding of how business logic translates into technology workflows requires both IT and business leaders making the right decisions together.  Over this series I’m going to give you a couple examples on when HA was needed.  Some in which HA was not. Later we will discuss different HA technologies and understanding when each one is right for you.  First I will define some key things to cover in a High Availability Conversation between the business and IT. 


There are three types of conversations: technical, business, and budget.   Business comes first.  Our first goal is to understand what the business is facing.  Some of this you may know in advance.  If so review the information to ensure business and IT communication.
·       Describe the business process –
o   What product or service are we providing?
o   What type of employees participate in the process?
o   What does each employee need to get from the process?
·       Diagram a workflow of how a customer interaction occurs
·       Add the steps that are IT specific
·       Critical days, weeks, months, holidays, or times of the day for the specific business process
·       Understand any current paint points

First make sure you understand the business process surrounding any technical system.  Draw shapes on dry erase boards, connect them to business process, make sure you understand who the business owner is for each technical system. Force self-examination and business awareness.  Add your IT system to the diagram.  Highlight business and IT dependencies.  Talk to teams to understand what happens when a system goes down. 

You can use Viso, PowerPoint, a dry erase board, or whatever you like. Make a visual representation based on the findings from the meeting.  Confirm that everyone can agree on it, or use it to solicit their teams for additional vetting.  You may not get it right on the first run.  That is okay.  This could be an interview process depending on the complexity of your IT environment. There are some standard technical terms that you should use

·       SLA – Service Level Agreement.  This is the expected agreement between the business and IT for overall system availability.  This may represent business hours.  In some cases, Service Providers may have contractual SLA’s.  If your business has an SLA with customers there could be a financial penalty for not meeting the SLA. 
·       RPO – Recovery Point Objective.  This defines the amount of data that can be lost on a system in and not cause impact or harm to the business.  The point at which all data or transactions must be recoverable.  The price of any solution increases with lower RPO’s.
·       RTO – Recovery Time Objective.  This is the amount of time it takes to get a system online in the event of an outage.  The time to recover will vary greatly on size, volume, and type of data discussed.  If you seek to reduce recovery time there are strategies or technical solutions that can be utilized.  The complexity of these solutions can add to the overall cost. 


After the first business meeting the technical team should meet to discuss their options.  I like to
develop three options when working with IT Managers.  This is where we, the technical staff, align a few things.

·       Proposed Technical Solution
·       Cost of Technical Solution
·       Cost to Maintain Technical Solution
·       Migration from the existing platform to the New

There are many ways to address HA.  Some could be solutions using Third party vendor products that you may already utilize, others can be hardware or virtualization options, and other can be database specific.  IT the architecture, size of the corporation, and business need for HA will shape these decisions.

Notice I said three options for IT Managers.  When you meet with the business you should have a cohesive vision.  When I started my IT career finding a manager with IT experience was extremely rare.  It is increasingly becoming more common.  As IT knowledge permeates the business world it will be more common to give the three options to the business for collaboration.  If you are lucky you are in an environment like that today.

For this step you need to take all business requirements and align them to technical justification for a solution.  Producing three options may seem difficult.  Think of this like the old children’s story Goldilocks and the Three Bears.  One of these technical solutions will be too cold, too much latency and not viable for 100% of business scenarios.  One of these solutions will be too hot, this is the spare-no-expense-throw-everything-you’ve-got-at-it-as-close-to-five-9’s-as-possible option.  One will be just right.  It will fit the business needs, not be cost prohibitive, and be maintainable solution.

You’re natural first reaction will be, “Why Three options!  Why not just propose the right one?”.

Valid point Dear Reader.  The reason is simple.  This is a high stakes mental exercise.  The business, jobs, profitability, security of our data, and many other things may rest on our solutions.  What makes a good IT person great?  Their mind.  Our ability to make virtual skyscrapers and complex virtual super highways out of thin air.  There may be a natural reaction to jump at a ‘best’ quick solution.  Take your time, think outside of the box.  You will find if you push yourself to consider multiple options, one may supplant the ‘best’ quick solution.  While too cold and too hot may not work in this situation, they may next time.

We’re done, right?  Wrong. 

If we are changing an existing system, how are we going to get there?  We need to understand if there are production outages windows we must avoid.  For example in Retail or Manufacturing there are normally Brown out and Blackout windows for IT.  Leading up to Black Friday, a United States shopping holiday, most Retail environments have IT Blackouts.  This means no changes can be made in production unless they are to fix an existing bug.  Everything must wait.  Leading up to a Blackout window there is a Brown out window.  During a Brown out no changes can be made to existing server infrastructures, this can limit the ability to allocate more storage from SAN’s, networking equipment, or networking changes in general.

If we are performing an upgrade to use a new HA technology, when will that happen?  How do we migrate?  What are the steps?  Have we coordinated with the application development teams?  They will need to regression test their applications when a migration occurs.  This means changes in the development, QA/UAT, and eventually the production environments.  Additional coordination will require the server and SAN teams.  Depending on the size of your IT organization this may be a large effort or the act of shouting over your cubical wall. 


We’ve spoken with the business.  We understand their process.  We drew it up, passed it around, and found a few new things along the way.  We met as an IT team.  We produced tentative solutions.  Took the ideas to Management and now we have a winner.  We’ve coordinated with our counterparts to make sure we knew what a tentative timeline would look like.  Our next step?  Present the solution to the business.
After this meeting the goal is to move forward.  To get the project going.  We’ve spent a lot of work
on this, and you want to see it implemented.  Sometimes that happens.  Sometimes it doesn’t.  There are a lot of factors that can go into why something doesn’t happen.  Because they can be legion, and it distracts from our overall topic, I’m going to skip why things do not happen.  It is an intriguing topic that I may blog on later.

When we move forward the goal it to do so in a collaborative way.  A good solution addresses business needs, technical requirements, project timelines, and budget.  There may be changes, there may be wrinkles that occur along the way.  What looked good on paper doesn’t always translate to technical viability.  It is important to maintain the communication amongst key stake holders along the way.

Alright Dear Reader, this has been a good kick off towards our topic.  If you have questions or comments please sound off below!  I look forward to hearing from you.  Next we will review a couple real-life examples of when HA was needed and when it wasn’t.  As always, Thank You for stopping by.