Tuesday, February 28, 2017

How Do I Add an R Package to SQL Server from In-Database R?

Hello Dear Reader!  I'm working on some really fun stuff with in-database R using SQL Server 2016.
I needed to add a couple packages to my R engine so I could use them and call them from an R script. I realized I had not added any external packages to my R instance so this was a perfect opportunity to write a blog!  So here is a quick blog on how you add an R Package to SQL Server.  This entire tutorial is if you have internet.  If you need to do this for a server, which probably doesn't have access to internet look at this MSDN article, Install Additional R Packages on SQL Server.


First we need to find our instance default install path for SQL Server.  In this case I'm a comic book geek and my named instances are JARVIS and STARK.  STARK is my 2016 instance so we'll need to find that path.

In this case my path is C:\Program Files\Microsoft SQL Server\MSSQL13.STARK\R_SERVICES\bin\x64 , then I want to hold down SHIFT and right click on the Rgui.exe, then select Run as administrator.



After this opens we need to create a variable to hold the location were we would like the package to be installed.  Then we can install the package we want to use, in this case the plyr package.


lib.SQL <- "C:\\Program Files\\Microsoft SQL Server\\MSSQL13.STARK\\R_SERVICES\\library"
install.packages("plyr", lib=lib.SQL) 

Notice I adjusted the path for my named instance.

After it installs our window should look like this.

Now I can run a simple test script loading the plyr package.

execute sp_execute_external_script
@language=N'R'
,@script =N'
       library(plyr);
       OutputDataSet<-InputDataSet'
,@input_data_1=N'select 1'

WITH RESULT SETS undefined;

If I get this error the package did not load.

Msg 39004, Level 16, State 20, Line 4
A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.
Msg 39019, Level 16, State 1, Line 4
An external script error occurred: 
Error in library(plyr) : there is no package called 'plyr'
Calls: source -> withVisible -> eval -> eval -> library

Error in ScaleR.  Check the output for more information.
Error in eval(expr, envir, enclos) : 
  Error in ScaleR.  Check the output for more information.
Calls: source -> withVisible -> eval -> eval -> .Call
Execution halted
Msg 11536, Level 16, State 1, Line 4
EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

If I the statement compiles, everything is fine and I get a result set.



As always Thanks for stopping by.

Thanks,

Brad







Friday, February 3, 2017

The High Availability Conversation Part 1: Introduction

Hello Dear Reader!  Twitter is a fantastic tool if you use SQL Server.  It is a great way to network, find presentations, interact with experts, and #SQLHelp offers some of the best technical forum conversation on the subject.  I was on Twitter and my friend Kenneth Fisher asked a question.




Kenneth’s question was interesting.  Let’s define a couple terms up front.  HA in this case is High Availability.  It is the capability for a database or data store to maintain availability and connectivity to a Graphical User Interface, Web Service, or some other data consuming application despite a localized outage or failure of the primary system.  

High Availability is different from Disaster Recovery, or DR.  Disaster Recovery is required when a disaster, natural or man-made, prevents the use of resources within a data center and the hardware within to support regular business processes.  In this blog, we will be addressing HA only as that was what Kenneth had asked about.

Asked for further explanation Kenneth didn’t need HA at his environment, more on this in a moment, he was looking for a rounded point of view.  I had a couple replies and then started direct messaging, DM`ing, Kenneth to share my insight.  This is a conversation you must have with the business.

  A keen understanding of how business logic translates into technology workflows requires both IT and business leaders making the right decisions together.  Over this series I’m going to give you a couple examples on when HA was needed.  Some in which HA was not. Later we will discuss different HA technologies and understanding when each one is right for you.  First I will define some key things to cover in a High Availability Conversation between the business and IT. 


THE HIGH AVAILABILITY CONVERSATION


There are three types of conversations: technical, business, and budget.   Business comes first.  Our first goal is to understand what the business is facing.  Some of this you may know in advance.  If so review the information to ensure business and IT communication.
·       Describe the business process –
o   What product or service are we providing?
o   What type of employees participate in the process?
o   What does each employee need to get from the process?
·       Diagram a workflow of how a customer interaction occurs
·       Add the steps that are IT specific
·       Critical days, weeks, months, holidays, or times of the day for the specific business process
·       Understand any current paint points

First make sure you understand the business process surrounding any technical system.  Draw shapes on dry erase boards, connect them to business process, make sure you understand who the business owner is for each technical system. Force self-examination and business awareness.  Add your IT system to the diagram.  Highlight business and IT dependencies.  Talk to teams to understand what happens when a system goes down. 

You can use Viso, PowerPoint, a dry erase board, or whatever you like. Make a visual representation based on the findings from the meeting.  Confirm that everyone can agree on it, or use it to solicit their teams for additional vetting.  You may not get it right on the first run.  That is okay.  This could be an interview process depending on the complexity of your IT environment. There are some standard technical terms that you should use

·       SLA – Service Level Agreement.  This is the expected agreement between the business and IT for overall system availability.  This may represent business hours.  In some cases, Service Providers may have contractual SLA’s.  If your business has an SLA with customers there could be a financial penalty for not meeting the SLA. 
·       RPO – Recovery Point Objective.  This defines the amount of data that can be lost on a system in and not cause impact or harm to the business.  The point at which all data or transactions must be recoverable.  The price of any solution increases with lower RPO’s.
·       RTO – Recovery Time Objective.  This is the amount of time it takes to get a system online in the event of an outage.  The time to recover will vary greatly on size, volume, and type of data discussed.  If you seek to reduce recovery time there are strategies or technical solutions that can be utilized.  The complexity of these solutions can add to the overall cost. 

REFINING HA: USING THE THREE BEARS TECHNIQUE

After the first business meeting the technical team should meet to discuss their options.  I like to
develop three options when working with IT Managers.  This is where we, the technical staff, align a few things.

·       Proposed Technical Solution
·       Cost of Technical Solution
·       Cost to Maintain Technical Solution
·       Migration from the existing platform to the New

There are many ways to address HA.  Some could be solutions using Third party vendor products that you may already utilize, others can be hardware or virtualization options, and other can be database specific.  IT the architecture, size of the corporation, and business need for HA will shape these decisions.

Notice I said three options for IT Managers.  When you meet with the business you should have a cohesive vision.  When I started my IT career finding a manager with IT experience was extremely rare.  It is increasingly becoming more common.  As IT knowledge permeates the business world it will be more common to give the three options to the business for collaboration.  If you are lucky you are in an environment like that today.

For this step you need to take all business requirements and align them to technical justification for a solution.  Producing three options may seem difficult.  Think of this like the old children’s story Goldilocks and the Three Bears.  One of these technical solutions will be too cold, too much latency and not viable for 100% of business scenarios.  One of these solutions will be too hot, this is the spare-no-expense-throw-everything-you’ve-got-at-it-as-close-to-five-9’s-as-possible option.  One will be just right.  It will fit the business needs, not be cost prohibitive, and be maintainable solution.

You’re natural first reaction will be, “Why Three options!  Why not just propose the right one?”.

Valid point Dear Reader.  The reason is simple.  This is a high stakes mental exercise.  The business, jobs, profitability, security of our data, and many other things may rest on our solutions.  What makes a good IT person great?  Their mind.  Our ability to make virtual skyscrapers and complex virtual super highways out of thin air.  There may be a natural reaction to jump at a ‘best’ quick solution.  Take your time, think outside of the box.  You will find if you push yourself to consider multiple options, one may supplant the ‘best’ quick solution.  While too cold and too hot may not work in this situation, they may next time.

We’re done, right?  Wrong. 

If we are changing an existing system, how are we going to get there?  We need to understand if there are production outages windows we must avoid.  For example in Retail or Manufacturing there are normally Brown out and Blackout windows for IT.  Leading up to Black Friday, a United States shopping holiday, most Retail environments have IT Blackouts.  This means no changes can be made in production unless they are to fix an existing bug.  Everything must wait.  Leading up to a Blackout window there is a Brown out window.  During a Brown out no changes can be made to existing server infrastructures, this can limit the ability to allocate more storage from SAN’s, networking equipment, or networking changes in general.

If we are performing an upgrade to use a new HA technology, when will that happen?  How do we migrate?  What are the steps?  Have we coordinated with the application development teams?  They will need to regression test their applications when a migration occurs.  This means changes in the development, QA/UAT, and eventually the production environments.  Additional coordination will require the server and SAN teams.  Depending on the size of your IT organization this may be a large effort or the act of shouting over your cubical wall. 

 THE HA CONVERSATION II: THE SOLUTION & THE RESPONSE

We’ve spoken with the business.  We understand their process.  We drew it up, passed it around, and found a few new things along the way.  We met as an IT team.  We produced tentative solutions.  Took the ideas to Management and now we have a winner.  We’ve coordinated with our counterparts to make sure we knew what a tentative timeline would look like.  Our next step?  Present the solution to the business.
 
After this meeting the goal is to move forward.  To get the project going.  We’ve spent a lot of work
on this, and you want to see it implemented.  Sometimes that happens.  Sometimes it doesn’t.  There are a lot of factors that can go into why something doesn’t happen.  Because they can be legion, and it distracts from our overall topic, I’m going to skip why things do not happen.  It is an intriguing topic that I may blog on later.

When we move forward the goal it to do so in a collaborative way.  A good solution addresses business needs, technical requirements, project timelines, and budget.  There may be changes, there may be wrinkles that occur along the way.  What looked good on paper doesn’t always translate to technical viability.  It is important to maintain the communication amongst key stake holders along the way.

Alright Dear Reader, this has been a good kick off towards our topic.  If you have questions or comments please sound off below!  I look forward to hearing from you.  Next we will review a couple real-life examples of when HA was needed and when it wasn’t.  As always, Thank You for stopping by.

Thanks,


Brad 

Monday, December 5, 2016

Optimize For Unknown

Hello Dear Reader!  It's been a while.  I was working with a friend and we came across an interesting problem.  They had a large amount of skewness/data skew.  This led to some performance issues for them.  The way this manifested itself was in a set of queries that ran quickly, normally within seconds.  Then occasionally they ran much longer.  To be precise, they ran about x800 times longer.


As you can imagine this is a less than ideal situation for a production environment.  Their solution was to add OPTION (RECOMPILE)  to all of their stored procedures.  This solved the issue with their data skew.  It caused additional side effects.  Increased CPU as every stored procedure now had to recompile before execution.  No stored procedure could experience plan reuse.  Using DMV's to track stored procedure utilization and statistics no longer worked.


"So Balls", you say, "What is the alternative?  Is there an alternative?  And what in the name of King and Country is skewness/data skew!"

Ahh great question Dear Reader!  Dear Reader why are you English today?

"Because you watched the Imitation Game last night, and Benedict Cumberbatch's voice is still stuck in your head."

Right as always Dear Reader!  Good point let's explain it and then do a Demo!


SKEWNESS/DATA SKEW


Skewness is a term from statistics and probability theory that refers to the asymmetry on the probability distribution of a real valued random variable about its mean.   This could get complicated quickly.  In simpler terms that I can understand it means that there are patterns based on variables with an assigned real value.  Based on those variables skewness can be determined and it is the
difference of the normal.

How does this effect our query plans.  With data skew we have a over abundance of data that fits one statistical model and it does not fit for others.  This means the way the SQL Server Cardinality Estimator estimates for one may be different for another based on statistics.

Here's a quick example.  I have a school with 100,000 students.  Every student has a combination of 10 different last names.  On average one could assume that every 10,000 students will have different last names.  If we randomly assign these values, there will be a slight skewness.  Most of the ranges will be similar.  For this example I'll use my students table from my college database.

       select
              lastname, count(*)
       from
              dbo.students
       group by lastname

       order by count(*) desc;



Now we move a new student to the area.  This one student will give us quite a bit of data skew, and will be extremely asymmetrical to the other results.


In order to show this in action we'll make a stored procedure that returns our First Name, Last Name, and the Course Name of students by last name.  Remember some students will have multiple courses.  This means we will have more results than we do last names.


if exists(select name from sys.procedures where name='p_sel_get_stu_name')
begin
       drop procedure p_sel_get_stu_name
end
go
create procedure p_sel_get_stu_name(@lname varchar(50))
as
begin
      
       select
              s.FirstName
              ,s.LastName
              ,c.courseName
       from
              dbo.students s
              left join enrollment e
              on s.studentID=e.studentid
              left join courses c
              on e.courseid = c.courseid
       where
                     lastname=@lname

end

So now we will execute this query and see the difference between our query plans and benchmark  the performance.

exec p_sel_get_stu_name 'Bradley' with recompile;



exec p_sel_get_stu_name 'Segarra' with recompile;



The first query took a little over a second to return two rows.  The second query was sub-second and returned 13,843 rows.  Each execution plan was different.  One was parallel, the other was serial.  That makes sense Parallel returned over 13,000 rows, serial only returned 2 row.  The statistical variance is different.  The cardinality estimate gave us different results.

Now let's make this bad.  I'm going to run the first query and second query again.  This time I'm removing the with recompile.

exec p_sel_get_stu_name 'Bradley';
exec p_sel_get_stu_name 'Segarra';

The first query did not change.   The second one did.


We used the cached plan.  Because of data skew we forced 13,843 rows through the serial execution plan.  The result was 8 minutes and 42 seconds instead of a sub-second query.


This is data skew.  We've shown that recompiling the query forces both to execute with their least cost plan.  Is there another option?  In this case we could use the query hint OPTIMIZE FOR UNKNOWN.

The benefit of OPTIMIZE FOR UNKNOWN is that we can remove the recompile.  This will allow us to get the best/least cost plan based on data skewness of the statistics.


if exists(select name from sys.procedures where name='p_sel_get_stu_name')
begin
       drop procedure p_sel_get_stu_name
end
go
create procedure p_sel_get_stu_name(@lname varchar(50))
as
begin
      
       select
              s.FirstName
              ,s.LastName
              ,c.courseName
       from
              dbo.students s
              left join enrollment e
              on s.studentID=e.studentid
              left join courses c
              on e.courseid = c.courseid
       where
                     lastname=@lname
       option (optimize for unknown)
end

exec p_sel_get_stu_name 'Bradley';
exec p_sel_get_stu_name 'Segarra';

Now we execute our procedures and we get our execution plans.  Here are our new query plans.




You'll notice that the execution plan based on statistical variance was parallel plan.  Both queries executed sub-second.  This is not the least cost plan for the first query.  In case you were curious here is a look at the histogram.


WRAP IT UP

So what does this mean?  For the business purpose of  speeding up a query option recompile is completely valid.  

It comes at a cost.  Recompilations, increased CPU utilization, and you loose the history of the execution of the stored procedure from DMVs.  

If those costs do not effect you, or effect the system less than the fluctuation of query performance then it is valid.  

There is also another alternative to use in your tool belt.  That is what we used today.  Like all things in computers use it judiciously.  Test, test, and retest before deploying into production.  As always Dear Reader, Thanks for stopping by.


Thanks,
Brad