Archive for the ‘Data Quality’ Category

Validating Orders & Revenue

Posted on July 26th, 2010 by Adam Greco  |  4 Comments »

(Estimated Time to Read this Post = 4 Minutes)

I recently received an e-mail from a blog reader who was having issues tying their Orders in SiteCatalyst to Orders in their back-end system.  Here is a snippet from the e-mail:

I have a little issue in my own SiteCatalyst setup that I recently discovered.  Sad for me I had trusted the number of Orders for each day’s Conversion Funnel and recently I decided to validate the numbers in SiteCatalyst against what our back-end system has.  SiteCatalyst is 5%-10% understated each day which makes for a heck of a difference at the end of the month!  I’d rather be understated than overstated, but can you give me some ideas where I should look first?

Unfortunately this is an all too common problem I hear out there.  In this post I am going to share some ideas on how you can tackle this Order/Revenue validation issue head-on and make sure you can trust your critical Orders/Revenue data in SiteCatalyst.

Order ID eVar
If you have an online shopping cart, you should already be setting the s.purchaseID variable with a unique Order ID when an Order takes place on the website.  This variable is used by SiteCatalyst to ensure Order uniqueness.  Unfortunately, the downside of this variable is that it is not readily available in the SiteCatalyst interface.  It is available in DataWarehouse but not in regular SiteCatalyst reports or Discover.  Carmen Sutter (@c_sutter) has submitted  an idea in the Idea Exchange to change this, but until then, I recommend that you set what I call an Order ID eVar variable.  To do this, all you need to do is set the same value you pass to the PurchaseID variable to a custom eVar.  This will allow you to see all Orders and Revenue by Order ID from within SiteCatalyst and Discover as you would any other eVar.  Once you have done this, you can open up this new Offer ID eVar and add your Orders or Revenue Success Event as needed:

In the example above, we can see that most Orders have only one Order ID, which is what we want.  However, in this case, we can see that one ID was counted twice.  That may require some research and I like to schedule a report like the one above to be sent to me weekly so I can make sure nothing strange is going on.

Data Sources Setup
However, while adding an Order ID eVar is helpful in seeing if you are over counting Orders in SiteCatalyst, it won’t tell you if you are under counting Orders or  how close your SiteCatalyst data is to your back-end systems.  To do this, I recommend you use Data Sources.  As a quick refresher, Data Sources allows you to import external data/metrics into SiteCatalyst (see post link for more details).  In this case, I recommend that you import in a file from your back-end system into SiteCatalyst which contains your unique Order ID, the number of Orders (which should always be “1″) and the Revenue Amount.  When you import data via Data Sources, you include the date that you want the data to be associated with so it doesn’t matter if you import the data on a daily, weekly or monthly basis, but the more frequently you upload it, the better so you can find issues quickly.

Here are step-by-step instructions on how to do this:

  • Create the Order ID eVar described above
  • Create two new Incrementer Success Events and name them “Back-End Orders” (Type=Numeric) and “Back-End Revenue” (Type=Currency)
  • Create a new Data Sources upload template (ClientCare or Omniture Consulting can assist with this).  You want to be sure to map the two new “Back-End” Success Events to the Data Sources template.  Even more critical, is that you want to include the newly created Order ID eVar in the Data Sources template.  If you do not do this, then you will not be able to see these two new Back-End metrics in the same Order ID eVar report that you have in SiteCatalyst (more on this later).

  • When you are done, you should have a Data Sources template that looks something like this:

  • Now all you have to do is work with your developers to have this file sent via FTP to the Data Sources FTP on a regular basis.

The Payoff
So by now, you are probably saying to yourself: “That’s a lot of work!”  No argument here!  However, hang with me as I share what the ultimate payoff is for doing this.  As you recall, our primary objective was to see if our online Order and Revenue data was matching what our back-end systems indicated.  Now that we have the Order ID eVar and two new “Back-End” Order and Revenue metrics, we have everything we need.  This is where the fun begins and we put it all together!

All you have to do now is to open the new Order ID eVar report and add all of the relevant metrics.  First, we will add the SiteCatalyst Orders and Revenue so we can see online Orders and Revenue by Order ID:

Next, we will add the two new “Back-End” metrics to the report and, since we were smart enough to include the Order ID eVar value in the Data Sources upload, SiteCatalyst knows which “Back-End” Order ID and dates line up with our online data:

Cool huh!  As far as SiteCatalyst is concerned, these offline metrics are connected to your Order ID eVar values just as if they had happened online.  Using this report, we can see if there are any differences between our online and offline data.  In the example above, it looks like the “Back-End” system had an order with $2,350 in revenue that wasn’t captured online.  Having this information makes it much easier to troubleshoot order submission issues.  You can even use DataWarehouse or Discover (only if you use Transaction ID Data Sources) to break down Order ID by browser, domain, IP address, etc… to see if you can figure out what is happening.  In addition, you can export this data to Excel and look at the totals to see how far off you are in general.

Finally, for the true SiteCatalyst geeks, you can create a Calculated Metric that divides Orders by Back-End Orders and/or Revenue by Back-End Revenue to see a trended % that each is off and set up Alerts to notify you if they deviate too much!  When you take into account this level of assurance all of a sudden the Data Sources work above might not seem like all that much in the long run!

Final Thoughts
If you sell products online, nothing is more critical than believing in your key metrics.  Even if you don’t sell online, the same principles here can be applied to lead generation forms, subscriptions or any other metrics you store in SiteCatalyst and also in your back-end systems.

Adam Greco is the Director of Web Analytics at Salesforce.com.  You can read his previous Inside Omniture SiteCatalyst blog at http://blogs.omniture.com/author/agreco/ and can follow him on Twitter at http://twitter.com/adamgreco.  You can also hear Adam on the BeyondWebAnalytics podcast.  Please send questions and comments to adam@the-omni-man.com.

Please note: I am no longer an employee of Omniture and the content/views expressed here are my own and not those of Omniture.

Data Quality – The Silent Killer…

Posted on December 7th, 2009 by Adam Greco  |  4 Comments »

In this post, I am going to talk about how Data Quality can kill an Omniture (or other Web Analytics) implementation.  I will share some of the problems I have seen and show some ways that you can help improve Data Quality…

Sound Familiar?
So you have been managing an Omniture implementation for a while.  You have your KPI’s lined up.  You have been sharing some dashboards and reports with people throughout your company.  People are starting to realize that they should talk to you before making website business decisions.  Suddenly, you find yourself in the executive suite to answer some key website questions.  Then, just as you are wrapping up your web traffic overview, an executive starts to calculate some numbers on a notepad and determines that the increase you show in Paid Search traffic doesn’t look right given other data they have seen from the SEM team.  She also questions the rise in traffic data for EMEA, knowing that his VP in the region told you traffic has been down over the last few months.  Suddenly, you are in a web analytics death spiral.  In a split second, you have to decide, do you defend your Omniture data and risk your reputation or do you back-pedal saying you will re-check the web analytics data and live to fight another day?

Hopefully this hasn’t happened to you, but it has happened to most of us who have been around the web analytics space for long enough.  Unfortunately, you only get so many chances to be wrong  about data you are presenting and even if your data is right, if you aren’t confident enough to stand by it, it might as well be wrong.

Minimizing Data Quality Risk
So how do you avoid this situation?  The first step is to realize that there is no way to be sure that all of your web analytics data is correct.  100% Data Quality is not only unattainable, but also not worth the time and effort it would take to achieve.  Therefore, I use a philosophy of risk minimization in which I try various techniques to minimize the key things that cause data quality issues.  The following will show you some of the ways to do this:

Ensure all Pages are Tagged
This is easier said than done.  As we all know, IT is usually used to deploy JavaScript tags and they often have more important things to do than to guarantee that every website page has a the [correct] JavaScript tag.  Fostering a good relationship with IT helps, but at the end of the day, new website pages are created all the time, and tags will be missing.

Use Technology
As you can imagine, where there is a need, there are technology vendors.  The main vendors that I have worked with or heard the most about are WASP and ObservePoint.  Not completely coincidental, ObservePoint was founded by John Pestana who was one of the co-founders of Omniture.  In a great blog post, John Pestana talked about getting rid of asterik’s in web analytics reports.  I am sure there are many other vendors out there offering similar products, but the gist of the technology is that it can spider your website and let you know which pages are missing JavaScript tags so eliminate any obvious omissions.

Blood, Sweat & Tears
Unfortunately, the main way that I have minimized web analytic data loss is by downloading data and looking for anomalies.  I normally do this by taking advantage of the Omniture SiteCatalyst Excel Client and downloading key data blocks by day or week and then using formulas to compare yesterday to the same day last week or last week to the week prior.  Once you have the data in Excel, you can do any type of statistical analysis you want on the data to see if anything looks “fishy.”  One thing I like to do is to use Excel conditional formatting to spot data issues.

The following is a screenshot example of using Excel to spot potential data issues.  In this example, I am looking at Page Views from one week prior to each day and if there is a change of more than 20%, I highlight it in red:

dq_excel2

Uh-oh… It looks like our daily data quality report indicates that we may have lost a tag on Friday for the Login page and something suspicious took place related to the Search Results page the same day.  Obviously, the downsides of this approach are that it is extremely manual and that it is in arrears.  As you know, once you miss a time slot of data in SiteCatalyst, there is no easy way to get that data back.  While this approach can minimize the data loss to a day, it won’t help you get the Login Page data back in the example above.

Therefore, the way I employ this approach is to focus on the top items within each variable.  This means, I focus on the pages with the most Page Views, the Form ID’s with the most Form Completions, the Orders for the most popular products, etc…With the Excel Client, you can download multiple data blocks at once and then use conditional formatting to easily spot the issues.  Done intelligently, Data Quality for 80% of your data can be done in under a few hours each day.  By doing this, you can feel more confident when your VP questions your data knowing that if something were significantly off, you would have known about it ahead of time.

Special Cases
I have found that there are a few other situations that commonly lead to missing or bad data so I quickly wanted to bring them to your attention so you can apply some additional effort to ensure they are tagged correctly:

  1. “Lightbox” pages where a new HTML page is often not loaded.  These often times are created as a window within a window and many times developers forget to put SiteCatalyst code within them.
  2. Flash/AJAX pages where the page changes dynamically or you have an entire site/page developed in Flash.  By extra careful around these as they often are missing tracking code (especially when done by an outside agency!).
  3. Dynamically generated content, such as a page that shows historical stock price data after a user enters a ticker symbol.  Often times, these dynamic pages are tagged as one single page, but might be better as unique pages from a web analytics viewpoint.

SiteCatalyst Alerts
If you have read my previous blog post on Alerts, you may figure out that you can use Alerts to help with Data Quality as well.  Alerts can be used to look for changes in key metrics by Month, Week, Day (or Hour in some cases).  These alerts can be handy to be notified when data is off  by more than x%.  However, I have found that if you want to look a more granular data (as in the example above), the current Alert functionality can be a bit limiting.  You can set alerts for specific sProp and eVar values, but not as easily as you can by using Excel.  Therefore, I would use Alerts as an early warning system an employ the previously mentioned techniques as your main defense against missing data.

Classification Data
Finally, when  thinking about data quality/completeness, don’t forget about SAINT Classifications.  If you have key reports that rely on SAINT Classifications, even if you have the source data collected perfectly, if you are missing key SAINT Classifications for that source data, your reports will be incorrect and indistinguishable from poor data quality in the eyes of your users.  You will know if you are missing SAINT Classification data if your classified reports have a large “None” row.  So how do you ensure your SAINT Classification data is complete?  What I do is create Excel data blocks for each Classification and isolate the “None” row for key metrics.

In the screenshot below, you can see that I have created a data block that looks for “Unspecified” Site Locale Full Names (the Excel Client doesn’t use None, but it uses “Unspecified” instead for some reason).  In this scenario, I store a 2-digit website country identifier in an  eVar and use a SAINT Classification to provide a full name.  I filter on “Unspecified”  where the metric is Visits, Form Views and Form Completes.

dq_excel4

After running, you will see a succinct report that looks like this:

dq_excel3

In this case, there are no Form View or Form Complete Success Events missing a Full Site Locale SAINT Classification, but there are some Visits missing the classification.  You can then easily go into SiteCatalyst or Discover, open the Full Locale Name report and break it down by its source to find out what values are left to be classified.

Finally, if you want to earn “extra credit” you can do this for all of your SAINT Classifications in one Excel workbook and make a summary screen like the one below which pulls the percentages that are unclassified into one screen so you can see how you are doing overall.  What is cool about this is that you can use the “Refresh All” feature of the Excel Client to check all of your Classifications while you get coffee and when you get back, you have a fully updated view of your SAINT Classifications.  In the above below, I have shaded some items in black that are OK if they aren’t fully classified, items in green that are acceptable and items in red that require attention:

dq_excel5

Final Thoughts
As you can see, Data Quality is a HUGE topic so it is hard to cover it all in one post, but hopefully some of the pointers here will get you thinking about how you can improve in this area.  One last thing I will mention is that like most things related to web analytics, tools are good, but qualified people are better!  Therefore, I think that any serious web analytics team will have a resource who has Data Quality as one of their primary performance objectives.  Without this, Data Quality tends to fall by the wayside.  Try to do whatever you can to convince your management that having a full or part-time person devoted to Data Quality will pay hefty dividends in the future…

Adam Greco is the Director of Web Analytics at Salesforce.com.  You can read his previous Inside Omniture SiteCatalyst blog at http://blogs.omniture.com/author/agreco/ and can follow him on Twitter at http://twitter.com/adamgreco.  Please send questions and comments to adam@the-omni-man.com.

Please note: I am no longer an employee of Omniture and the content/views expressed here are my own and not those of Omniture.