Wednesday, August 31, 2016

Due Diligence In Selecting Technology

Information Technology is not a religion.  Your buzzwords are not a mantra.  When a developer is confronted with a new technology s/he must evaluate this new technology and decide what the value is in relation to the learning curve, not just jump on the bandwagon.  Even worse is the management type who decides he is going down a technological path because he met the Tableau salesman on the golf course.  

At Sentia we take the technology selection process VERY seriously.  We choose Microsoft not because we are told to or because we are familiar with it, we choose Microsoft because it is the ONLY vendor who provides a complete solution from database to desktop to web.  We choose Windows not only because it is the Microsoft operating system, but because it doesn't require a six figure administrator to configure and make work.  We choose SQL Server because it is faster than Oracle (the only other real choice) on Windows. We choose .NET because it has a well documented amazing Integrated Development Environment (IDE) in Visual Studio, and it doesn't need a ton of other tools to develop applications. Gradle. Eclipse.  IntelliJ.  YourKit.  Clover.  The list goes on ad nauseum.  

We are so adamant about choosing the right tool for the job that we didn't even develop web applications until 2009 when Microsoft introduced their version of the Model/View/Controller (MVC) technology that got rid of the notion of statefulness across the internet and gave us a way to build web applications that made sense.  At the time it didn't make sense to jump ship to Apache/PHP on Unix/Linux (pronounced Line-Ux, the guy's name is Linus Thorvald) just to get web technology that makes sense.  I might make a different decision today.

Sure, Microsoft has made some grabs at the money tree with SharePoint and BizTalk.  These applications are designed to make a complicated task easier than it is.  SharePoint is nothing more than FrontPage (a WYSIWYG web page generating tool), along with document management.  You already have a powerful tool to develop webpages and web applications in Visual Studio (the IDE we talked about above) and you don't need documents.  No really.  You don't need documents.  An attorney MIGHT need a signature, but the contents of the document should be put in the database so it can be indexed and searched.  SharePoint is the electronic version of a file cabinet and the people who advocate, develop for, and use it should be fired.  BizTalk is more of the same.  It allows developers to use disparate data sources as native.  We already have a way to use any data source as native with SQL Server Linked Servers.  In fact, Linked Servers is the basis for Sentia's Information Integrator, an ETL tool that is so easy to use we think anyone can do it.  

Yes, we also know that we have to use JavaScript or can use jQuery or AngularJS or some other plug in to make out webpages look pretty and add functionality.  Microsoft was so dead set on ASP and ASP.NET that they attempted to force everyone to use their technology.  Now however with the advent of MVC and NuGet, we can use these plugins and maintain versioning on them (there seems to be a new version every week) easily with Microsoft tools.

What do we get for this loyalty?  We get a single group of developers who can do everything that needs to be done, from start to finish, without having to learn the latest flavor of some dumb Java tool or write a servelet to make Apache work, or know nothing about database design and development because they have never seen one.  We get to say 'This is the way we do things' and if that way changes, we all change together.  We get to build tools that automate the development of software because we all agree that 'this is what we do, and this is the way we do it.'  We don't have some new guy saying 'Gee I know this and I use that'  We have the old guys saying 'this the the best way and we know because we spent the time and did the due diligence to make sure that it is.'

The conclusion is that with this methodology, and the tools we have developed to generate about 80% of the code we deploy, we always do the same thing, the same way every time.  This makes the code base easy to read, intuitive and simple to maintain.  That doesn't even count the elimination of 80% of the development time and expense for our clients.

Really, if you have anyone else develop your applications, you are making a mistake.  If your project is done differently, it is done wrongly and you will save your time, money and effort.  Even better, if we can automate what we do, we can automate what you do.  We have clients who used to run themselves ragged 80 or 100 hours per week, dozens of employees, emailing spreadsheets and working fingers to the bone, who now only click a button to generate bills at the end of the month.  80-100 hours to 35 seconds and dozens of expensive employees to none.  It isn't religion, it is efficiency.  That is what we do.

http://sentiasystems.com

Friday, August 26, 2016

Object Relational Mappers: Why and Why NOT Use Them

As developers, we find that the database king; it is the foundation on which all other software is written.  The problem with that is that nobody starts out as a database developer.  In our quest to become the perfect developer, almost all of us neglect the first and most important step, learning about databases.  First, we find ourselves struggling to wrap our minds around set based programming (remember trying to find a quarter sized rock using tweezers and a magnifying glass instead of two boxes, one with holes slightly larger than a quarter and the other with holes slightly smaller?)  Second, we have to pick some type of data access language and understand that whatever it is that we pick probably has memory leaks and definitely has a ton of overhead.  If you live in Microsoft's world you have to remember to close your ADO.NET connection and dispose of your objects when you are done with them.  Third, we have to decide how to model the middle tier objects and what logic goes where.  There is little wonder that most (I've heard as high as 90%) new software begun has never seen production.

So there has to be a better way, right?  Surely someone has written an application that does all this grunt work for you.  Enter the Object Relational Mapper.  Wikipedia states:

"Object-relational mapping (ORM, O/RM, and O/R mapping tool) in computer science is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to construct their own ORM tools."

"That sounds great!" you say.

"I want one!" you exclaim.

"Which one do I use?" you ask.

Aye, but there's the rub.  Honestly, if you don't understand what all this automagic stuff is doing, you probably shouldn't use it.  First, I eschew using anything that doesn't allow me to pull back the curtain as see how it works.  Microsoft provides Entity Framework which installs all kinds of visual geegaws to access your data, and all kinds of code with the warning (and I paraphrase) 'You don't know what this does so don't mess with it.'  If that particular code doesn't do what I think it should do, I want to know why and I simply can't.  All of the ORMs I have used and heard of suffer from this problem.

Another problem is that you don't know what kind of set based code is getting run on your sever.  One of the biggest problems with any application is "it's running slowly" and everyone looks at the database.   If you can't see the code running, you certainly can't troubleshoot it. 

Best database design practices are to lock the users out of the table structure and all data access is done through stored procedures, so we can apply logic and security and auditing.  Unless your ORM writes procedures (and it doesn't) you have to write them yourself.  That kind of makes the ORM useless as a code generation tool, since you have to write your own code anyway.  Even worse, all the set based code I've seen, is horrible.  if you venture one little pinky toe outside of simple Create/Read/Update/Delete (CRUD) operations and you will, you have to do searches at a minimum, the set based code passed by these ORM tools gets hairy fast.

We, here at Sentia wrote our own ORM, and we call it the Object Relational Mapper (of course).  We think of it less as an ORM and more of a code generation tool that we use to produce the same code that we would have written by hand anyway.  Sentia's ORM produces all the stored procedures for CRUD and additionally provides a Search capability as well.  More stored procedures are generated to access related data.  If we want all the blue cars we do a search where car.ColorID = 36 (36 is blue and we can search for that as well) but we can also search for many to many relationships.  If we want to get a student's class schedule it might look like this Class class = new Class(student.StudentID, GetClassesByForeignKey.ByStudentID).  That might be the very first line of code we have written in this application, it uses stored procedures so we can be secure, and it is returned by code that a good developer might write, not some black box that has a skull and crossbones warning on it.

There are several things we've had to do over the years to make the code produced by Sentia's ORM more usable.  We put all the generated code in a 'Generated Code' folder so we can tell what was produced and what was hand written.  All the generated code is put into partial classes so that we can extend the functionality of them without having to come up with kludgy new names.  We are working on an application where the user can sign up for the service.  All calls to the database require a username and password that a new user won't have.  We has to extend the Professionals class with a new method that took a new professional object with its properties set called CreateNew that didn't require a User Name and Password.  This is all seamless to the downstream developer of course.  he would expect to see professionals.Add or Professionals.Delete in the intellisense menu, and now s/he can also see CreateNew(Professional).

What is the conclusion?   We decided that Sentia's ORM should write the code that we would write if we were absolutely perfect and did everything that we were supposed to do.  In many ways, generating code like that is the ONLY way to get it.  Otherwise, using nHibernate or the Entity Framework or whatever other half-assed thing you need to get the job done is ok, if you are writing a movie list application for your mother to access her collection online or you are a junior developer who can't get the job done any other way.  If you are a junior developer and you can't get things accomplished any other way, know that you are going to have scalability and security problems.  Even worse, you can't write the code that Sentia's ORM generates.  We can't.  Even we end up cutting corners sometimes, and it takes us weeks or months to do the same work completed in seconds with Sentia's ORM.

What should you do, dear reader?  You should either come to work for us, or give up and have us write your application for you.  Heck, if you sell it for the price you think you can produce it for, call us and we'll do it for half. 

It ain't arrogant if it is true.

Tuesday, August 23, 2016

If You Are Using Email for Workflow, Excel or QuickBooks At All, You Are Wasting Time and Money

That statement probably just cheesed a whole lot of you off and the sad part is that it is 100% true.  Here is what you are doing now: you get some kind of information over email from someone in your organization.  You do a little number crunching in Excel and attach the spreadsheet to an email that you then send to accounting.  In accounting, the spreadsheet is opened, maybe some more number crunching goes on, the data gets entered into QuickBooks and maybe a check is issued or a report generated. Sound about right?  This is absolutely wrong, and it’s not just the top three offenders here, it is every separate application you type anything into.
Don’t feel badly, that’s the way everyone, from Bank of America on down, does it.   There is a better way.  What if the person who generated the data originally could type it into a custom built application that does exactly what you do, for you?  Let’s take an example.  You have a guy on the loading dock receiving shipments.  Usually, s/he will count the number of boxes and sign a bill of lading and hand carry a copy to Accounts Payable who types the items in to QuickBooks then issues a check.  What if, instead, the guy/girl at the loading dock, scanned each box as it came in and this hypothetical piece of software recorded it, counted it, added it to inventory and issued a check to pay for it and put it in the ledger all in one fell swoop?  What if the warehouse person scanned the box and the barcode at some location in the warehouse where the box is going to live until the contents are needed?  Then you know where the box is, what and how many it contains, how long it has been there and what it has cost you to warehouse it, since you know how many cubic feet are in the warehouse, what the fixed and variable costs for it are and the size of the box.  You just eliminated all the overhead in your company that doesn’t actually move the ball forward AND gained valuable insights into cash and inventory flow.
We already told you that everyone runs a company the former way and not the latter.  Many of you are too young to remember Braniff, so we’ll use a more recent example (and we’ll also discount the financial sector since they are not too bright (how to you have billions in assets and go bankrupt? (the question is rhetorical, I know what happened (and they are still idiots (and I know you love my nested parentheticals)))) and the energy sector since they financial sector took them down in 2008 (…and Enron cooked the books)).  USAir filed one of the largest bankruptcies in history for the same reason Braniff did: they don’t know how much it costs to fly you from Point A to Point B.  Substitute Chrysler. Substitute Worldcom.  We use USAir because what they do is more of a commodity.  So if they don’t know how much it costs to fly you where you want to go, and someone else has a sale and can do it cheaper and the brass at USAir thinks that nobody can do it better, they price their service lower, and less than it can be provided for, and BAM bankruptcy.
I tell you the USAir story for one reason.  If they used a system like the one I can build for you, they wouldn’t have failed.  They would know that it takes X number of pounds of kerosene for your particular trip and that the kerosene going into that airframe was bought for Y dollars per pound and the airframe itself can be amortized over Z years at a cost of W per year and the pilot makes V dollars per flight and the attendant makes U dollars per hour and the maintenance is Q dollars per month (they don’t wait for parts to fail, when they get so many hours they replace them (you can’t just pull over on the side of the sky) and on and on ad nauseum.  The point is, it is calculable, and they could have done it and didn’t, they depended on YOU, dear reader, to bail them out.  So if they knew how to calculate the cost of your trip, they could add whatever profit they deem appropriate and go about their business.  When American or Delta offers a sale that is less than the cost of providing the service, you can laugh up your sleeve and let them. Then you can watch Delta and American go file bankruptcy.
Notice how we switched from USAir to you?  Yes this all applies to you.  Whether you are a roofer or an accountant or Brian Moynihan (of Bank of America) if you are emailing spreadsheets, or even have separate applications for HR, Accounting, Warehousing,  Project/Process Management, you are missing the boat.  You need one application to rule them all so that when the product hits the dock, the check gets printed (or ACHed, whichever).  Or when you make a sale, the proceeds go directly into accounting, no typing, and therefore no people required.  People are expensive and they make mistakes.  Software is comparatively free and doesn't need to type.  If you have this kind of application in place you can automate everything but the actual hammering of nails and selling of houses (or building and selling Chryslers (they automate a lot of the production too) or flying people around the country).  …and you can automate a lot of the marketing as well.
So yes, Virginia, there is a Santa Claus.  If nobody else is doing business this way, and you can, you should.  …and put your competitors out of business.  If you go to one of our sister company’s website http://sentiahealth.com we will explain how we automated the entire health insurance industry.  That will save the American health insurance consumer about 1/3 of their bill.  Not $300, $200.  Not $1200 for your family, $800.  Collectively, we will save you (all) one trillion dollars.
We have done the same for several companies, and we can do it for yours.

Thursday, August 18, 2016

Never, EVER Use Excel for Anything. Quit It!

Today we are going to take on the darling of the business world: Excel. One of my guys was on assignment to Bank of America Last year when they came out with a whole new set of rules and regulations centered around Excel spreadsheets. As a younger guy, he was simply amazed that a company who touts its cutting edge technology would use something that was designed in the dim memory of history. I informed him that the Sumerians developed spreadsheets about the same time cuneiform writing was invented somewhere past 5000 years ago and for the same reason: to keep track of trade. Sure Excel does a couple of things clay tablets don't, but that is all making stuff automatic instead of by hand, not a paradigm shift.  So that isn't what is wrong with Excel ( ...or Lotus Notes or any other spreadsheet application). Let's dig in a little and figure it out. 

When you are using Excel, nobody is checking your work. Maybe you are supposed to enter a value, and you enter in "yes." Maybe someone else does. The fact is that you can. Forbes says Excel Microsoft's Excel Might Be The Most Dangerous Software On The Planet and for this reason. Turns out that the Chief Investment Office at JPMorgan needed a new widget to help them evaluate risk in a new venture with credit cards. This model was based on copying and pasting data into an Excel spreadsheet. Of course, someone not only made a copy and paste error, but there were errors in the formulas as well and they caused the loss of tens of billions of dollars. You'd never make a mistake like that right? They thought the same thing. 


The second reason you should never use excel was inherent in the first vignette is data typing. Data Typing is the art of putting like kinds of data together and enforcing that likeness, so we don't get "yes" in the sex column. In Excel there is no data typing. I had a nurse years ago out in the world collecting health information (unbeknownst to me) and saving the data in a spreadsheet. When she got an anomalous reading she literally typed in "beyond machine range." One of the things I was trying to do for this company was give an overall view into the health of the population. I'm not sure about you, but I have zero idea how to average "beyond machine range" into a mean total cholesterol measure. While Excel sits on top of Microsoft's JET database engine, it is not a database, and will allow you to type in anything. more on the JET engine later. 


Third, You are locked into doing everything manually. If you want to update the data, you have to type it in or copy and paste it. Humans don't type well. We know this. So maybe a wisenheimer doesn't type "yes" into the sex column, but maybe there is an "N" there, and what does that mean? Manual entry creates these types of problems. 


Fourth, Excel locks you into the single user nightmare. Let's discount for a minute that only one person at a time can update the spreadsheet. Let's think about one guy having it open read only and making some kind of dumb decision (tens of billions, remember?) based on old data. You are updating the thing with the NEW data while he is making his decision, and he can't see it. Epic Fail. Altus agrees and makes the 'manual jungle' case as well. 


Sixth, Excel isn't secure. Here is another little nugget on the JET data engine: The username and password you so diligently set is kept in plain text in the header of your file and you can do an internet search on how to open the file and find that username and password. the JET data engine uses an ISAM database. There is nothing wrong with ISAM model, but there are far better things to use. ISAM uses basically text files though they are indexed. There are several utilities that make ISAMs work like or mimic Relational databases, but you already have that in the Relational Database Engines , or Relational Database Management Systems (RDBMS) like SQL Server and Oracle. Examining the differences between ISAM and RDBMS is beond the scope of this article, but the log file alone is worth switching. The research is left as an exercise for the student. 


Seventh, you don't need files. Nobody ever needs files. There is only one good thing to do with files and that is dig the information out of them and then delete them. Yes, there is an argument to be made for images and contracts, but if I can buy a car on eBay without a signature, you probably don't need a contract either. If you are an attorney and you NEED to have things signed, copy and paste the contents of the document into a a DATABASE (more on this later as well) take a picture of the contract and upload IT to the database and destroy the paper copy. You don't need paper and you don't need files. If you have paper and you have files you have to know where the particular file or piece of paper is before you can access it. you can't tell externally what is in it. You can't search for it. It is a nightmare and we curse Microsoft for coming up with SharePiont. SharePoint isn't really a new product either. They took FrontPage and added a document management system to it. You didn't need FrontPage and you do not need documents so you don't need SharePoint. 


Eighth, You can't have relationships in Excel. So maybe on one sheet you have students and on another you have classes. How are you going to associate students to classes and throw in a semester and a grade and all the things you need to run your hypothetical school? The answer, of course, is 'you can't.'  DBPros has a decent article on this.   


Ninth, you can't see the history of a record or a sheet in Excel. There is no audit trail.  Maybe I have a mission critical application and i need to know who has seen a record, what they changed, what the old value is and when the change was made. This is impossible in Excel. Before you Excel gurus say differently, and work for three weeks to make some kludgy macro half ass work, I can add auditing with a few lines of code and with far less work, AND do that to all my databases. 


 So I've convinced you that Excel is absolutely evil, but what do you use? You still have a job to do/department/business to run. We've given you several hints above, but we will spell it out now: Use a Relational Database. We use SQL Server from Microsoft.  You get these advantages:

  • Strong Data Typing and Data validation. There is no more "beyond machine range", "yes" or 168 inch tall women. Yes, that happened.
  • You can automate processes. Remember Forbes? What they wanted was a report. Reports are a click away.
  • Multi user. You can type into the same table all day and everyone dcan see everyone else's data. There are even rules to automatically detect when users are updating the same record at the same time and resolve the conflict, but with a well designed database, you would never have that problem.
  • Security. WE have one application with thousands of users and one doesn't even know the rest exists. Doctors see only their patients, patients see only his or her own records, and can communicate with only his or her own doctor. Yes, it is unhackable. Go Try. I'm not going to wait thought, because it can't be done, not even by the NSA.
  • You don't need files. Nobody needs files. Take the information out of them and put them in your database, indexed for speed and searchability with a built in reporting capability
  • Relationships. You can build a report that shows you those students and classes above by dragging and dropping. With a reporting tool, you can even make it look pretty. Now you can have rosters and class schedules with a button click. ...and whatever it is your business does as well
  • Auditing. There are a couple of ways to go about it in an RDBMS, You can use triggers, but those have their own problems. We here at Sentia use stored procedures to encapsulate chunks of functionality that include the audit trail automatically.
  • Consistent Data. RDBMSs use a transaction log file to maintain consistency. The discussion of the log file is outside the scope of this article, but if something happens to your database or even your database server, you have a log file full of all the changes to the database since the last backup.
  • Reporting. If your data is all in one place and everyone is doing what they need to do and entereing their data in you can generate ad hoc reports that look like Excel with a query. if you want graphs and text and that kind of thing you can use a tool like Crystal Reports or SQL Server Reporting Services (SSRS). At Sentia we use SSRS to deliver gorgeous, color, printable reports to the desktop and across the internet.
So yes, Virginia, there is a Santa Claus, it just isn't Excel. Particularly if your data is sensitive, and it is, Excel is a horrible choice. The database is such a better choice that having one designed and built, along with an application to get the data in and out of it and produce the (pretty) reports you want to see will actually save you money, and not just in the long run. All of the people in your office that have Excel and are using it, can generally be automated out of existence. Well, they still exist, but you don't have to write them a paycheck to play with the evil monster that is Excel.  We have completely automated not just one, not even two, but several businesses. We have one guy on our staff that does nothing for one company but to click the button once a month to produce the billing report. If you've been following our sister company, Sentia Health you already know that we have automated the entire health insurance industry, making it just an application and getting rid of its associated costs. This saves the average policy holder about 1/3 of his or her average health insurance bill monthly and replaces that an average of $150 per month per user with a $10 subscription fee. So that saves you personally $140 per month, your family $140 per month times the number of family members and the United States about one trillion dollars per year

 And we can do that because the insurance companies are still emailing spreadsheets and using manual processes.

We can automate your business as well.  Maybe you should call us before your competition does.

Wednesday, August 17, 2016

Writing Efficient SQL: Set-Based Speed Phreakery

This is a complete copy and paste from an article I use as a teaching tool to my junior developers.  It was originally published February 4, 2010 by .  you can find the original post here.
 
If you have been following along, you hear me talking about "Set-Based" versus "Procedural" programming and a box of rocks a lot.  Today we are gong to give you a better example and show you real performance statistics from this example.
 
If you jump in the way back machine, I was coding an application in 1998, and doing a great job for someone with little to no training and only a year or so of experience.  I convinced the powers that be at my company to send me to a TSQL class.  For the most part it was the most boring thing I had ever attended, kind of like Spanish I for someone who grew up in Madrid.  After class one evening, I brought a script I had created and asked the instructor to take a look.  It was a train wreck where a badly designed database had collided with a junior developer's code.  She looked and advised me "to avoid cursors completely, unless I really, REALLY needed them."  I was off into the set based world.
 
Many, maybe hundreds of times I have heard the junior developer say "I HAVE to use a loop, that is the only way to do it."  Invariably, they are wrong.  Yes, I do use looping for the odd "single use" piece of code, but generally never and always never in production.
 
Have fun with Kelli's article, it shows several things, and yes, Peter Larsson ("Peso") is probably a genius.  Here it is:
 
SQL is a declarative language, designed to work with sets of data. However, it does support procedural, “row-by-row” constructs in the form of explicit cursors, loops and so on. The use of such constructs often offers the most intuitive route to the desired SQL solution, especially for those with backgrounds as application developers. Unfortunately, the performance of these solutions is often sub-optimal; sometimes very sub-optimal.
Some of the techniques used to achieve fast SQL code, and avoid row-by-row processing can, at first, seem somewhat obscure. This, in turn, raises questions over the maintainability of the code. However, in fact, there are really only a handful of these techniques that need to be dissected and understood in order to open a path towards fast, efficient SQL code. The intent of this article is to investigate a very common “running total” reporting problem, offer some fairly typical “procedural” SQL solutions, and then present much faster “set-based” solutions and dissect the techniques that they use.
The examples in this article are based on the classic “running total” problem, which formed the basis for Phil Factor’s first SQL Speed Phreak Competition. This challenge is not some far-fetched scenario that you will never encounter in reality, but a common reporting task, similar to a report that your boss may ask you to produce at any time.
In my experience, I’ve mostly found that the ease of producing the solution is inversely proportional to its performance and scalability. In other words, there is an easy solution and a fast solution, as well as many solutions in between. One may argue that for a report that runs once a month, clarity and maintainability are as important as speed, and there is some truth is this. However, bear in mind that while you can get away with a simple solution on a table with a few thousand rows, it won’t scale. As the number of rows grows so the performance will degrade, sometimes exponentially.
Furthermore, if you don’t know what’s possible in terms of performance, then you have no basis to judge the effectiveness of your solution. Once you’ve found the fastest solution possible then, if necessary, you can “back out” to a solution that is somewhat slower but more maintainable, in full knowledge of the size of the compromise you’ve made.

The Subscription List Challenge

The Subscription List challenge, as presented in Phil Factor’s Speed Phreak competition, consisted of one table containing a list of subscribers, with the dates that the subscribers joined, and the dates that they cancelled their subscription. The task was to produce a report, by month, listing:
  • The number of new subscriptions
  • The number of subscribers who have left
  • A running total of current subscribers
Just about anything you wished to do was allowed in this challenge. You could use any version of SQL Server, change the indexes on the original table, use temp tables, or even use cursors. The person whose T-SQL solution returned the correct results in the fastest time was the winner.
In a nice twist, Phil Factor reported the standings daily and allowed multiple submissions. This allowed the entrants to fine-tune their solutions, and to help out some of the other contestants. The original solutions were developed against a sample dataset of 10,000 rows. However, the top entries performed so well that they were also tested against a 1 million row dataset. Likewise, all of the solutions in this were developed against the 10K-row dataset, but tested against both.
The eventual winner was Peso, the SQL Server MVP Peter Larsson, whose solution we will present and dissect, and which returned the required report, running against a 1 million row table, in just over 300 milliseconds. Eight other entrants ran very close, with solutions that ran in under 600 milliseconds.
Speed was all important in this contest; this was a super-charged drag race and there was little point turning up to the starting line in a mini-van. However, for lesser SQL mortals, you could be forgiven for regarding Peso’s winning solution with an equal mix of amazement and confusion. Hopefully, after reading the rest of this article, the solution will be clearer and the techniques accessible to you in solving your own T-SQL problems.

Setting Up

All you need to do to get started is to:
  1. Create a new database in your test or development SQL Server instance.
  2. Download the sample data from http://ask.sqlservercentral.com/questions/92/the-subscription-list-sql-problem.
  3. Play the script found in the download to create and populate the table in your sample database.
Go ahead and take a look at the data by running SELECT * FROM Registrations. Figure 1 shows a sample of the data that we will be using to create the report.
933-KK1.JPG
Figure 1: The data found in the Registrations table
Notice that many registrations do not have a DateLeft value. These subscriptions are ongoing and are active past the time period required for the report.

Validation

In order to prove that our solution works, we need to know what the correct answer should look like. In this case, of course, we could simply run Peso’s winning script. However, in practice, we won’t have that script and so we’ll need to do our own investigation of the data, to make sure we fully understand the report’s requirements.
The report must provide a list of the months, from the beginning of the data to the month of the contest (September, 2009), along with the count of new subscribers and cancellations for each month. After September 2009, there are no new registrations, only cancellations. The report must also supply a running total of the number of active subscriptions. The simple calculation for the number of active subscriptions in a given month is:
Number of subscriptions from previous month
+ new subscribers in current month
– cancellations in current month
To get started, run these two queries from Listing 1:



Listing 1: Another look at the data
Figure 2 shows the partial results. The first query shows the new registrations for each month, and the second query shows the cancellations. Notice that there are no cancellations until June 2004.
933-KK2.JPG
Figure 2: The counts of new registrations and cancellations.
Let’s manually calculate the first few results. The first month, January, 2004, has 167 new subscriptions with no cancellations; therefore, the running total is 167. There are no cancellations until June 2004, so we can just add the new subscriptions for each month through May. In June, we have to subtract one cancellation. Table 1 shows the results for the first few months of 2004, manually calculated, which provide a good representation of the report.
933-KK3.JPG
Table 1: The report for 2004
Now we have a means to validate the results returned by our report, we’re ready to start building it. Remember that, just like Peso, we are unlikely to come up with the best solution first; more likely we’ll create a workable solution first and then tweak it to improve performance. Of course, after each tweak of your own solutions, be sure to implement regression testing to make sure that the improved solution still returns the correct results.

Row by Agonizing Row

The reason that so much SQL code found out in the wilds performs poorly is that SQL is nobody’s “first language” (with the possible exception of Joe Celko). Developers tend to master the basic SQL syntax quickly, but then start firing out SQL data retrieval code in a manner strongly influenced by the philosophy and practices of their primary, procedural programming language. When speaking at my local VB.NET group, I found that some of the developers in the audience were shocked that there is another, better way to perform updates than one row at a time.
At its most brutal, the row-by-row solution might look something like the one shown in Listing 2. There is no aggregation going on at all. What we have here is basically an explicit cursor that runs through every single row in the dataset, one at a time, performing the necessary calculations and keeping track of the running totals in a temporary table.

Listing 2: The RBAR solution
The term row-by-agonizing row (RBAR), first coined by Jeff Moden on SQLServerCentral.com, is apt here. This solution, while providing the correct results, runs in 7 seconds on 10,000 rows, on the hardware used to test the solutions in the contest. When applied to the 1 million row dataset, it took 13 minutes!
Imagine a box of marbles sitting in a box on a table. If you needed to move all the marbles to another box on the other side of the room without picking up either box, how would you do it? Would you pick up each marble, one at a time and march it across the room to the second box? Unless you were getting paid by the minute, you would probably scoop them all up and carry them all in one trip. You would avoid the marble-by-marble method, if you wanted to be efficient.

Faster, but Still Iterative

A kinder, though still iterative solution might take the following approach:
  • Create a temporary table (#subscriptions), load it with the DateJoined data from the Registrations table, and then aggregate on this column, by month and year, to get a count of the number of people who joined in each month
  • Create a CTE (cancellations), load it with the DateLeft data from the Registrations table, and then aggregate on this column by month and year to get a count of the number of people who left in each month, in each year. In lieu of a CTE, you could also use another temp table or a derived table.
  • Update the PeopleLeft column in #subscriptions with the number of people who left in each month, from the CTE
  • Use a cursor to loop through each month in the temporary table, calculating the required running total of subscribers
This solution is shown in Listing 3.

Listing 3: The Easy, Iterative Solution
So, instead of looping through every row in the data, we aggregate the data to the month level, and then loop through each possible month in the report performing the calculations. We have a lot of date calculations going on in our INSERT and UPDATE statements. These date calculations are very expensive because multiple functions must operate on each date.
When tested, this solution took 360 milliseconds on 10,000 rows and three seconds on 1 million rows. While still not performing as well as the winning solution, it is a big improvement over the original row-by-row solution.
So, this solution, while still flawed, is at least one that will work adequately on a small report table. The other advantage is that the logic behind it is crystal clear, and it will be an easily maintainable solution. BUT…it doesn’t scale. As the results indicate, as more and more rows are added to the registrations table, the report will take longer and longer to run.
We could spend some time tweaking to improve performance, but let’s move on to an even better way.

The Fastest Solution: DATEDIFF, UNPIVOT, and a “Quirky Update”

The winning script in the competition, script “4e” submitted by Peter Larsson (Peso), is reproduced in Listing 4. As noted earlier, it performs so well that, against 10K rows it is almost un-measurably fast, and even when tested against a dataset of one million rows, it took just milliseconds to run. In other words, Peso’s solution is scalable!

Listing 4 Peso 4e
As you can see, we still have the temporary table and the CTE, but what goes on inside each is quite different. Also, this solution has no need for the dreaded cursor, although it does use a few slightly “unconventional” techniques to arrive at the answer. Let’s break this bad boy apart and see if we can learn something.
The overall strategy behind this solution is to pre-aggregate the data down to the rows required in the report while minimizing the date calculations. By doing so, the query engine makes just one pass through the initial data and works with as little data as possible when calculating the running total.
There are three major sections to the code:
  • Aggregation Process– the goal here is to reduce the working dataset, stored in a temp table called #Stage, to the minimum possible number of rows, so that date calculations and the subsequent running total calculation are inexpensive.
  • Running Total calculation – here, we run an UPDATE statement on a CTE to calculate the running total
  • Producing the report –  making clever use of the OUTPUT clause

The Aggregation Process

The essential goal of this step is the same in Peso’s solution (Listing 4) as it was in the solution seen in Listing 3. We need to build a temp table that contains a row for each month and a count of the new subscriptions and cancellations, upon which to base the running total calculation. However, the way that each solution populates the temp table is very different.
Take a look at the table definition for #Stage table (lines A to B), and you will notice something unusual right away. Whereas in the previous solutions we used a conventional DATETIME type for the theMonth column, Peso uses an integer – more about this later. Go ahead and run the table creation statement (A to B).
The SELECT statement used to populate the table is an aggregate query based on the UNPIVOT operator. Run the SELECT statement without the INSERT (C to I) to see what will actually populate #Stage. Figure 3 shows the partial results. Instead of a date, theMonth contains a number from 1248 to 1316 along with the PeopleJoined and PeopleLeft column totals. The Subscribers column contains zeros at this point. This is the data that will be used in the second part of the solution. Except for having integers in the month, it looks very similar to the spreadsheet we created at the beginning of the article to validate the results.
933-KK4.JPG
Figure 3: The data used to populate #Stage
Let’s now break this SELECT down, step-by-step to see how it works.

The Pre-aggregation Step

In Listing 3, in order to arrive at a temp table that contained a row for each month and a count of the new subscriptions and cancellations, we had to make two full passes through the Registrations table, once aggregating on the DateJoined column, and once on the DateLeft, performing some expensive date calculations as we went.
Using a “pre-aggregation” involving the DATEDIFF function, followed by a clever trick using the UNPIVOT operator, to normalize the data, Peso managed to take just one pass through the table and so minimize the impact of date manipulations.
In order to understand the final result set shown in Figure 3, we need to first  look at the query that populate the derived table, on which the UNPIVOT operator works. Run lines E to F of Listing 3 will return the contents of this derived table, as shown in Figure 4.

933-KK5.JPG
Figure 4: The heart of the unpivoted data
It may be hard to recognize the fact at first glance, but this is, in many respects, the critical moment in the script. In it, we make a single pass through the 10K (or 1 million) row base table, and “summarize” everything we need from it, in order to solve the reporting problem, into only 586 (or 1580) rows. All subsequent operations are performed on this limited working data set and so are relatively inexpensive.
If you look at the original data, you will see that a date can fall anywhere within a month. In order to efficiently eliminate the “day,” this query uses the DATEDIFF function to convert all dates to the number of months that have passed since the date represented by 0 (1900-01-01). So, for example, any date within January 2004 becomes 1248 (104*12). Instead of grouping by month and year, or converting all dates to the first of the month, this solution just converted all the dates to an integer. Eventually, as we perform the running total calculation in the second part of the solution, these integer values are converted back to dates, using DATEADD.
The query counts the number of rows grouped by the DateJoined and DateLeft calculated values. Therefore, all subscribers who joined and left at the same time are grouped together, or “connected”, as part of the pre-aggregation.
You may be wondering why the solution leaves the dates as integers in the INSERT statement and converts them back to dates at the end of the process. It would be simple to just apply a DATEADD function to add the months back in the first statement:
DATEADD(M,DATEDIFF(M,0,DateJoined),0)
This would convert all the dates to the first of the month at one time, but Peso chose to convert them back to dates later, during the running totals calculation. By doing so, the DATEADD function to convert back to date was performed on just 69 values instead of up to 20,000 values (the 10,000 rows times two dates for each row minus the NULLs)

The UNPIVOT Trick

The next step is to “normalize”, or unpivot, the pre-aggregated data. This is the step that means we can avoid any further reads of the base table, because it gives us all the data we need calculate the monthly number of subscribers.
Let’s look at the UNPIVOT part of this statement (F to G), by running the query in Listing 5, which temporarily removes the GROUP BY and HAVING clauses and replaces the SELECT list found in the original solution.

Listing 5: The un-aggregated UNPIVOT query
Figure 5 shows the data after the pre-aggregation together with the unpivoted results:
933-KK6.JPG
Figure 5: Before and after unpivoting
As you can see, the UNPIVOT operator, introduced with SQL Server 2005, takes column headings and turns them into data. This is the opposite of the PIVOT function that turns data into column headings.
NOTE:
Pre SQL Server 2005, we could have used a derived table with UNION ALL to mimic the UNPIVOT functionality. However, that would have led to two passes on the source table, and not just one.
In this case, the DateJoined and DateLeft column names now become values in the theCol column. The integer values stored in DateJoined and DateLeft become individual entries in a new column, theMonth.  Now, each row contains a number representing the month, a number of registrations, and a value showing whether the data represents a new subscription or a cancellation.
Now we have a list of registrations by month and action (see Figure 6). If you take a closer look at the data after the UNPIVOT is applied, you will see that we have multiple rows for each theMonth and theCol combination. For example, Figure 6 shows two rows for month 1306 and DateLeft. The duplicates exist because the initial aggregation grouped the data on the combination of DateJoined and DateLeft. Since we have split those values apart, there are now duplicates and further aggregation must be done.
933-KK7.JPG
Figure 6: The UNPIVOT results

Repivot and Aggregate

Moving out a bit more, let’s look at how we re-pivot the data and perform the final aggregation. The SELECT list (C to D) creates a sum of PeopleJoined and a sum of PeopleLeft grouped by theMonth (G to H). The code to focus on here is shown in Listing 6.

Listing 6: Using CASE to pivot
The CASE function in the SELECT list determines which rows to count for PeopleJoined and which to count for PeopleLeft. If theCol equals DateJoined, then the row belongs in PeopleJoined. If theCol equals DateLeft, then the row belongs to PeopleLeft. For those of you familiar with the PIVOT function, you probably know that using CASE is another way to pivot data. In this case, the query “pivots” the theCol column to create a total for PeopleJoined and PeopleLeft for each month. Peso could have chosen to use PIVOT instead, but he found that the CASE method was slightly faster. Figure 7 shows how the data looks after pivoting with CASE.
933-KK8.JPG
Figure 7: After aggregating
Finally, the HAVING clause (H to I) filters out any rows with no new subscriptions. In the 10,000 rows, all months after September 2009 have cancellations, but no new subscriptions, and future months are not part of the report. Since HAVING filters rows out after aggregates are applied, the CASE function in the WHERE clause returns the total number of new subscriptions after grouping on theMonth. If a month contains only cancellations, as in the months after September 2009, the CASE function returns zero, and the HAVING clause filters those rows out.
We now have 69 rows which is also the number of rows (months) that will appear in the report. For each month (still represented by the integer result of DATEDIFF), we know the number of subscribers and the number of cancellations. The only thing missing is the number of subscriptions, which is still set at zero. Peso has managed to get to this point with only one pass through the data by his masterful pre-aggregation, followed by use of UNPIVOT and CASE.
Go ahead and run lines B to I to populate #Stage.

Calculating the Running Total

The next part of the solution calculates the required running total of subscribers via a “quirky” update command operating on a common table expression (CTE) (lines J to O).
Again, let’s walk through it step-by-step.

Use a Variable to help you Count: no Cursors!

So far, the solution has manipulated our data to get it ready for calculating the running total. Instead of using a while loop or a cursor to calculate this total, a variable @Subscribers is defined and set to zero (I to J). Initializing a variable on the same line that the variable was defined is new to SQL Server 2008. If you are using SQL Server 2005, you will have to change this line:

Into these two lines

Even though we are not using any kind of loop, behind the scenes SQL Server updates one row at a time. Luckily, the pre-aggregation used before getting to this point has transformed the 10,000 original rows into 69, and the implicit cursor that SQL Server uses is faster than an explicit cursor.

The Ordered CTE

CTEs were introduced with SQL Server 2005. At a minimum, you can use CTEs to replace temp tables and views or to separate out a part of the logic of your query. CTEs also have some special uses, for example, for performing recursive queries.
Inside the CTE, we convert back to “normal” dates, and we do this by loading the data from our staging table into a CTE, using the DATEADD function as we go. Since DATEADD is used just once for this small number of rows, it is not that expensive.
If you run just the query inside the CTE definition (K to L), you will see that now theMonth has been converted back into a date and that the data is sorted. Only the running total of subscribers is missing. Figure 8 shows the partial results.

933-KK9.JPG
Figure 8: The CTE results
Normally, you cannot order data inside a CTE. One way around this rule is to use the TOP keyword, which allows you to retrieve a number or percentage of rows. In order to control which rows a query returns when using TOP, you can use ORDER BY to sort the data. In this case, Peso chose to use the largest possible integer, 2147483647, to make sure that all the rows will be returned.

The Quirky Update

You can use a CTE in one statement to select, update, delete, or insert data. This is the first time that I have seen the UPDATE statement update the data in the CTE itself (M to N) and not just another table joined to the CTE, though there is nothing stopping you from doing so as long as no rules for updating data are violated.  The statement actually updates the #Stage table, which is the basis for the CTE. The UPDATE statement includes an OUTPUT clause (N to O) which we will discuss shortly.
The SET clause of the UPDATE statement has an interesting formula:

For each row of our data, the Subscribers column is set to the current value of the variable @Subscribers and then increased or decreased based on what happened that month. The value stored in Subscribers, once it is calculated, is saved in the variable. The variable is then available for the next row of data.
This style of update is unofficially referred to as a “quirky update”. It refers to the ability to update variables and columns in a table simultaneously and in an ordered fashion. Here, the ordering is guaranteed by the ORDER BY in our CTE. If this technique were applied to a permanent table the ordering would be established by the table’s clustered index; the update would be done according to the clustered index, but would fail of the table was partitioned, or if parallelism occurred. In any event, it is very important that the calculation is done in the correct order, and that order is guaranteed in this example.
Since the #Stage table is a heap we won’t have problems with the ordering being affected by a clustered index. We won’t get parallelism either because the data is now only 69 records, and holds much less data than a page. Finally, partitioning is not possible on temporary tables.

Producing the Report: A clever use for OUTPUT

The UPDATE statement populates the Subscribers column of the #Stage table. After the update, if we just select the current contents of #Stage, we have the report.  Instead of doing that, Peso chose to use an OUTPUT clause, another relatively new T-SQL feature, introduced with SQL Server 2005. The OUTPUT clause can be used along with any data manipulation statement. It provides deleted and inserted tables, just like triggers do. The deleted table returns the data before the transaction. It can be data that is deleted or the data before it is updated. The inserted table returns the data after the transaction. It can be data that is inserted or data after the update. In this case, the inserted table returns the data how it looks after the update. By using this technique, we avoided accessing the table again to display the data.
Run the last part of the script (I to P) to see that the report works. Figure 9 shows the partial results.
933-KK10.JPG
Figure 9: The report

One Small Hole

It was stated as part of the competition rules that it could be assumed that all months had at least one new subscriber.
However, it’s worth noting that if there was a particular month that had no new subscriptions, then that month would not be included in the results. In fact, if a month in the middle of the data had no new subscriptions but had some cancellations, the report will not count the cancellations, and the report will be incorrect. To demonstrate this, change the DateJoined values to 2004-07-01 in the rows where the DateJoined values are within June, 2004 in the Registrations table. Then run the script again. You will see that the report no longer includes the cancellation and all the subsequent values are incorrect.
Fortunately, this problem is easily overcome, via something like an outer join to a sequence or numbers table (see, for example, http://weblogs.sqlteam.com/peterl/archive/2009/11/03/Superfast-sequence-generators.aspx)

Not Quite as Fast, but a Bit More Conventional

Not all queries and scripts require that you take the time to tune them to be fastest they can possibly be. The performance of a query that runs once a month does not matter as much as a query that runs 1000 times a minute. Of course, as you learn more, you will automatically incorporate better performing techniques into your code.
In the real world, there is a balance to be struck between maintainability and outright speed. However, once you’ve started to understand the considerations that are involved in getting the code to run as fast as possible, you can then decide to “back out” to what might be a slightly slower but more maintainable solution. Alternatively, you might decide that the time required to get the code as fast as possible is more time than you have to spend.
So, for example, if we had established Peter’s solution (Listing 4) as the fastest, but felt that we were willing to sacrifice some of the speed in return for ease of understanding and maintainability, then we might opt for a compromise solution, such as that shown in Listing 7. It is based on the “faster but still iterative” solution from Listing 3, but borrows Peso’s DATEDIFF technique to reduce the number of required date calculations.

Listing 7: The compromise solution
When this “compromise” solution was tested, it ran in 74 ms against 10,000 rows and in 847 ms against 1 million rows. Still quite a bit slower than Peso’s solution, but not bad! Even though it uses a cursor for the running the total calculations, the population of the temp table has been improved, and the performance is acceptable.
The important point to remember is that, really, it is the aggregation process that is the most critical. Once you are working with a small data set, the exact manner in which the running total calculation is performed becomes less of an issue, although it will obviously still have some impact.
Table 2 shows how the four solutions stack up:
Solution
10,000 Rows
1 Million Rows
Row-by-row (Listing 2)
7 seconds
13 minutes
Faster but still iterative (Listing 3)
360 milliseconds
3 seconds
Peso  (Listing 4)
0 milliseconds (too fast to measure)
300 milliseconds
Not quite as fast, but conventional (Listing 7)
74 milliseconds
840 milliseconds

Summing Up

The real proof of performance to Peso’s brilliant solution is that, even when Phil Factor tested it against one million rows, the report ran in milliseconds. Here are some of the important points:
  • Avoid row-by-row processing, unless on just a handful of rows
  • Pass through the data as few times as possible, preferably just once
  • Use DATEDIFF to convert the date to an integer if the day of the month is not needed
  • Minimize calculations whenever possible by pre-aggregating to a small number of rows first
  • Use UNPIVOT and CASE to realign the columns
  • Use a variable, instead of a cursor, to calculate running totals
  • The OUTPUT clause can be used to display data during an update instead of running an additional SELECT statement
At the very least, I hope you have learned that there can be many ways to come up with the correct answer, but not all solutions are created equal.
The fastest solution presented in this article has reached perfection in terms of performance and scalability. Is it always worth attempting to achieve perfection? Even the SQL Server query optimizer comes up with a plan that is “good enough” and not necessarily perfect. But some techniques perform so poorly on large datasets that learning how to avoid them will pay tremendous dividends.

Downloads