Friday, February 3, 2017

The High Availability Conversation Part 1: Introduction

Hello Dear Reader!  Twitter is a fantastic tool if you use SQL Server.  It is a great way to network, find presentations, interact with experts, and #SQLHelp offers some of the best technical forum conversation on the subject.  I was on Twitter and my friend Kenneth Fisher asked a question.




Kenneth’s question was interesting.  Let’s define a couple terms up front.  HA in this case is High Availability.  It is the capability for a database or data store to maintain availability and connectivity to a Graphical User Interface, Web Service, or some other data consuming application despite a localized outage or failure of the primary system.  

High Availability is different from Disaster Recovery, or DR.  Disaster Recovery is required when a disaster, natural or man-made, prevents the use of resources within a data center and the hardware within to support regular business processes.  In this blog, we will be addressing HA only as that was what Kenneth had asked about.

Asked for further explanation Kenneth didn’t need HA at his environment, more on this in a moment, he was looking for a rounded point of view.  I had a couple replies and then started direct messaging, DM`ing, Kenneth to share my insight.  This is a conversation you must have with the business.

  A keen understanding of how business logic translates into technology workflows requires both IT and business leaders making the right decisions together.  Over this series I’m going to give you a couple examples on when HA was needed.  Some in which HA was not. Later we will discuss different HA technologies and understanding when each one is right for you.  First I will define some key things to cover in a High Availability Conversation between the business and IT. 


THE HIGH AVAILABILITY CONVERSATION


There are three types of conversations: technical, business, and budget.   Business comes first.  Our first goal is to understand what the business is facing.  Some of this you may know in advance.  If so review the information to ensure business and IT communication.
·       Describe the business process –
o   What product or service are we providing?
o   What type of employees participate in the process?
o   What does each employee need to get from the process?
·       Diagram a workflow of how a customer interaction occurs
·       Add the steps that are IT specific
·       Critical days, weeks, months, holidays, or times of the day for the specific business process
·       Understand any current paint points

First make sure you understand the business process surrounding any technical system.  Draw shapes on dry erase boards, connect them to business process, make sure you understand who the business owner is for each technical system. Force self-examination and business awareness.  Add your IT system to the diagram.  Highlight business and IT dependencies.  Talk to teams to understand what happens when a system goes down. 

You can use Viso, PowerPoint, a dry erase board, or whatever you like. Make a visual representation based on the findings from the meeting.  Confirm that everyone can agree on it, or use it to solicit their teams for additional vetting.  You may not get it right on the first run.  That is okay.  This could be an interview process depending on the complexity of your IT environment. There are some standard technical terms that you should use

·       SLA – Service Level Agreement.  This is the expected agreement between the business and IT for overall system availability.  This may represent business hours.  In some cases, Service Providers may have contractual SLA’s.  If your business has an SLA with customers there could be a financial penalty for not meeting the SLA. 
·       RPO – Recovery Point Objective.  This defines the amount of data that can be lost on a system in and not cause impact or harm to the business.  The point at which all data or transactions must be recoverable.  The price of any solution increases with lower RPO’s.
·       RTO – Recovery Time Objective.  This is the amount of time it takes to get a system online in the event of an outage.  The time to recover will vary greatly on size, volume, and type of data discussed.  If you seek to reduce recovery time there are strategies or technical solutions that can be utilized.  The complexity of these solutions can add to the overall cost. 

REFINING HA: USING THE THREE BEARS TECHNIQUE

After the first business meeting the technical team should meet to discuss their options.  I like to
develop three options when working with IT Managers.  This is where we, the technical staff, align a few things.

·       Proposed Technical Solution
·       Cost of Technical Solution
·       Cost to Maintain Technical Solution
·       Migration from the existing platform to the New

There are many ways to address HA.  Some could be solutions using Third party vendor products that you may already utilize, others can be hardware or virtualization options, and other can be database specific.  IT the architecture, size of the corporation, and business need for HA will shape these decisions.

Notice I said three options for IT Managers.  When you meet with the business you should have a cohesive vision.  When I started my IT career finding a manager with IT experience was extremely rare.  It is increasingly becoming more common.  As IT knowledge permeates the business world it will be more common to give the three options to the business for collaboration.  If you are lucky you are in an environment like that today.

For this step you need to take all business requirements and align them to technical justification for a solution.  Producing three options may seem difficult.  Think of this like the old children’s story Goldilocks and the Three Bears.  One of these technical solutions will be too cold, too much latency and not viable for 100% of business scenarios.  One of these solutions will be too hot, this is the spare-no-expense-throw-everything-you’ve-got-at-it-as-close-to-five-9’s-as-possible option.  One will be just right.  It will fit the business needs, not be cost prohibitive, and be maintainable solution.

You’re natural first reaction will be, “Why Three options!  Why not just propose the right one?”.

Valid point Dear Reader.  The reason is simple.  This is a high stakes mental exercise.  The business, jobs, profitability, security of our data, and many other things may rest on our solutions.  What makes a good IT person great?  Their mind.  Our ability to make virtual skyscrapers and complex virtual super highways out of thin air.  There may be a natural reaction to jump at a ‘best’ quick solution.  Take your time, think outside of the box.  You will find if you push yourself to consider multiple options, one may supplant the ‘best’ quick solution.  While too cold and too hot may not work in this situation, they may next time.

We’re done, right?  Wrong. 

If we are changing an existing system, how are we going to get there?  We need to understand if there are production outages windows we must avoid.  For example in Retail or Manufacturing there are normally Brown out and Blackout windows for IT.  Leading up to Black Friday, a United States shopping holiday, most Retail environments have IT Blackouts.  This means no changes can be made in production unless they are to fix an existing bug.  Everything must wait.  Leading up to a Blackout window there is a Brown out window.  During a Brown out no changes can be made to existing server infrastructures, this can limit the ability to allocate more storage from SAN’s, networking equipment, or networking changes in general.

If we are performing an upgrade to use a new HA technology, when will that happen?  How do we migrate?  What are the steps?  Have we coordinated with the application development teams?  They will need to regression test their applications when a migration occurs.  This means changes in the development, QA/UAT, and eventually the production environments.  Additional coordination will require the server and SAN teams.  Depending on the size of your IT organization this may be a large effort or the act of shouting over your cubical wall. 

 THE HA CONVERSATION II: THE SOLUTION & THE RESPONSE

We’ve spoken with the business.  We understand their process.  We drew it up, passed it around, and found a few new things along the way.  We met as an IT team.  We produced tentative solutions.  Took the ideas to Management and now we have a winner.  We’ve coordinated with our counterparts to make sure we knew what a tentative timeline would look like.  Our next step?  Present the solution to the business.
 
After this meeting the goal is to move forward.  To get the project going.  We’ve spent a lot of work
on this, and you want to see it implemented.  Sometimes that happens.  Sometimes it doesn’t.  There are a lot of factors that can go into why something doesn’t happen.  Because they can be legion, and it distracts from our overall topic, I’m going to skip why things do not happen.  It is an intriguing topic that I may blog on later.

When we move forward the goal it to do so in a collaborative way.  A good solution addresses business needs, technical requirements, project timelines, and budget.  There may be changes, there may be wrinkles that occur along the way.  What looked good on paper doesn’t always translate to technical viability.  It is important to maintain the communication amongst key stake holders along the way.

Alright Dear Reader, this has been a good kick off towards our topic.  If you have questions or comments please sound off below!  I look forward to hearing from you.  Next we will review a couple real-life examples of when HA was needed and when it wasn’t.  As always, Thank You for stopping by.

Thanks,


Brad 

Monday, December 5, 2016

Optimize For Unknown

Hello Dear Reader!  It's been a while.  I was working with a friend and we came across an interesting problem.  They had a large amount of skewness/data skew.  This led to some performance issues for them.  The way this manifested itself was in a set of queries that ran quickly, normally within seconds.  Then occasionally they ran much longer.  To be precise, they ran about x800 times longer.


As you can imagine this is a less than ideal situation for a production environment.  Their solution was to add OPTION (RECOMPILE)  to all of their stored procedures.  This solved the issue with their data skew.  It caused additional side effects.  Increased CPU as every stored procedure now had to recompile before execution.  No stored procedure could experience plan reuse.  Using DMV's to track stored procedure utilization and statistics no longer worked.


"So Balls", you say, "What is the alternative?  Is there an alternative?  And what in the name of King and Country is skewness/data skew!"

Ahh great question Dear Reader!  Dear Reader why are you English today?

"Because you watched the Imitation Game last night, and Benedict Cumberbatch's voice is still stuck in your head."

Right as always Dear Reader!  Good point let's explain it and then do a Demo!


SKEWNESS/DATA SKEW


Skewness is a term from statistics and probability theory that refers to the asymmetry on the probability distribution of a real valued random variable about its mean.   This could get complicated quickly.  In simpler terms that I can understand it means that there are patterns based on variables with an assigned real value.  Based on those variables skewness can be determined and it is the
difference of the normal.

How does this effect our query plans.  With data skew we have a over abundance of data that fits one statistical model and it does not fit for others.  This means the way the SQL Server Cardinality Estimator estimates for one may be different for another based on statistics.

Here's a quick example.  I have a school with 100,000 students.  Every student has a combination of 10 different last names.  On average one could assume that every 10,000 students will have different last names.  If we randomly assign these values, there will be a slight skewness.  Most of the ranges will be similar.  For this example I'll use my students table from my college database.

       select
              lastname, count(*)
       from
              dbo.students
       group by lastname

       order by count(*) desc;



Now we move a new student to the area.  This one student will give us quite a bit of data skew, and will be extremely asymmetrical to the other results.


In order to show this in action we'll make a stored procedure that returns our First Name, Last Name, and the Course Name of students by last name.  Remember some students will have multiple courses.  This means we will have more results than we do last names.


if exists(select name from sys.procedures where name='p_sel_get_stu_name')
begin
       drop procedure p_sel_get_stu_name
end
go
create procedure p_sel_get_stu_name(@lname varchar(50))
as
begin
      
       select
              s.FirstName
              ,s.LastName
              ,c.courseName
       from
              dbo.students s
              left join enrollment e
              on s.studentID=e.studentid
              left join courses c
              on e.courseid = c.courseid
       where
                     lastname=@lname

end

So now we will execute this query and see the difference between our query plans and benchmark  the performance.

exec p_sel_get_stu_name 'Bradley' with recompile;



exec p_sel_get_stu_name 'Segarra' with recompile;



The first query took a little over a second to return two rows.  The second query was sub-second and returned 13,843 rows.  Each execution plan was different.  One was parallel, the other was serial.  That makes sense Parallel returned over 13,000 rows, serial only returned 2 row.  The statistical variance is different.  The cardinality estimate gave us different results.

Now let's make this bad.  I'm going to run the first query and second query again.  This time I'm removing the with recompile.

exec p_sel_get_stu_name 'Bradley';
exec p_sel_get_stu_name 'Segarra';

The first query did not change.   The second one did.


We used the cached plan.  Because of data skew we forced 13,843 rows through the serial execution plan.  The result was 8 minutes and 42 seconds instead of a sub-second query.


This is data skew.  We've shown that recompiling the query forces both to execute with their least cost plan.  Is there another option?  In this case we could use the query hint OPTIMIZE FOR UNKNOWN.

The benefit of OPTIMIZE FOR UNKNOWN is that we can remove the recompile.  This will allow us to get the best/least cost plan based on data skewness of the statistics.


if exists(select name from sys.procedures where name='p_sel_get_stu_name')
begin
       drop procedure p_sel_get_stu_name
end
go
create procedure p_sel_get_stu_name(@lname varchar(50))
as
begin
      
       select
              s.FirstName
              ,s.LastName
              ,c.courseName
       from
              dbo.students s
              left join enrollment e
              on s.studentID=e.studentid
              left join courses c
              on e.courseid = c.courseid
       where
                     lastname=@lname
       option (optimize for unknown)
end

exec p_sel_get_stu_name 'Bradley';
exec p_sel_get_stu_name 'Segarra';

Now we execute our procedures and we get our execution plans.  Here are our new query plans.




You'll notice that the execution plan based on statistical variance was parallel plan.  Both queries executed sub-second.  This is not the least cost plan for the first query.  In case you were curious here is a look at the histogram.


WRAP IT UP

So what does this mean?  For the business purpose of  speeding up a query option recompile is completely valid.  

It comes at a cost.  Recompilations, increased CPU utilization, and you loose the history of the execution of the stored procedure from DMVs.  

If those costs do not effect you, or effect the system less than the fluctuation of query performance then it is valid.  

There is also another alternative to use in your tool belt.  That is what we used today.  Like all things in computers use it judiciously.  Test, test, and retest before deploying into production.  As always Dear Reader, Thanks for stopping by.


Thanks,
Brad 

Monday, August 29, 2016

SQL Azure, Hekaton, and Bob Dorr is Awesome

Hello Dear Reader!  I had recently written a blog for the PFE blog, "How Many Tables Can I Have In SQL Azure DB & SQL Server".   It was a fun blog that I had written during a very long layover when traveling home.  It covers the maximum number of objects that you can create in SQL Server.


In this blog I had written up a demo to create the maximum number of tables you can in a database.  I also explained that based on table creation rates it would take around 64 days for me to reach that limit.  As a point of interest I tried creating In-Memory OLTP/Hekaton tables vs. native T-SQL tables.  There I noticed something very interesting.  Creating In-Memory OLTP tables was slower than creating T-SQL tables.  A lot slower.

To be crystal clear this is specifically on the creation of tables nothing more.  This has nothing to do with the insert, deletion, or reading of data.  My blog was about how many OBJECTS could a database hold, and I had a tight loop creating tables over and over again.  I explored different options to get as many tables created as quickly as possible.

"So Balls", you say, "You said this was slow.  How slow?"

Great point Dear Reader, back to the question!

I was able to create 388 native T-SQL tables per second and only 3 In-Memory OLTP tables.  On the book of faces someone asked about the results very quickly.



I reached out to my colleague Robert Dorr (Blog) from the SQL Team.   He and Bob Ward, from the SQL Tiger Team, (@BobWardMS | Blog) are running a great blog called Bob SQL, https://blogs.msdn.microsoft.com/bobsql/ , you should check this out.

He was kind enough to write up a reply that covered how things work under the covers.  To read his blog click here.


QUICK SUMMARY
Each In-Memory table is managed compiled code.  By making In-Memory objects compiled this allows us to save CPU cycles and allows us to skip some of the traditional over head of native T-SQL objects.

When we issue the DDL to create an In-Memory table we follow the same processes as a native T-SQL table. We also do two additional steps we create the Access DDL code and we compile the Access DDL code.

These steps give us a one time hit on creation of an object that give us long term benefits as we execute against our In-Memory OLTP/Hekaton tables.

WRAP UP



There are nothing but great questions out there Dear Reader.  The worst kind of question is the one that is not asked.  In this case the question pushed me to look for an answer that I already wanted to know.


In this case a demo that I almost didn't write, a question from a place that I normally do not look for questions, are all that we needed to get an answer.  My thanks again to Robert Dorr for taking the time to give us the answer.



I love it when a plan comes together.  Until next time Thanks for stopping by.

Thanks,

Bradley Ball