Wednesday, September 29, 2010

What's in Your Data?

You've heard the old addage "garbage in, garbage out" as it relates to data quality and the quality of subsequent analytics, right?  It is true that numbers that mis-represent reality will not tell an accurate story when they are crunched, but data quality is easier said than done.  And how bad do bad data need to be before they become really bad?  Afterall, unless there are systemic issues or consistent biases with a dataset and how it is collected, the "averages" will tend to give a pretty good representation of, well, the average - good enough anyway to gleen some business insight into a subject of study.  The loser in this scenario, however, is a crisp understanding of the underlying business variability, which oftentimes (unfortunately in this case) ends up being more important for business planning and decision making than the average.  But still, this is no reason to discard a dataset that is suspected to be unclean.

It turns out that there are a whole host of quantitative techniques for identifying and dealing with "bad" data.  No technique will turn garbage into gold, however.  One must eventually adopt a philosophy of sacrificing data quality for the act of getting down to business and actually doing something with the data.  Far too often there is an inordinate amount of effort put into data quality (rightfully so in some cases - like regulatory reporting), with analytics becoming an afterthought.  If data are to be leverage to make better, smarter decisions - with limited resources - there needs to be a relaxation of the high expectations for data quality. 

We can have it both ways though, if we reconsider how we capture data: Any data that requires a human to use any kind of judgement, and that additionally requires that same (or another) human to input those data into a computer, is right off the bat, "bad" (think surveys).  This can be mitigated somewhat by carefully controlled data forms, but the point is that no two humans will look at the same data call identically, and even the same human might look at it differently at different points in time.  A far better way to capture data is through the machine capture of human activity.  This is the "data exhaust" that gets emitted when humans interact with machines - login frequency, clicks, emails, web searches, non-cash purchases, monitoring systems, etc.  Let's admit it, most of our business activity one way or another interacts with machines, and those machines do (or could) capture the who, what, where and when of this activity (it is up to business analytics to figure out the "why").  Afterall, business analytics at its core is all about trying to understand specific human behaviors that relate to one's business (sales, marketing, operations, etc.).  So why not use a data source that specifically and automatically tracks that, rather than trying to replicate it with a subjective surveys?

Whatever the quality of your data, your means of capturing them or your philosophy and tolerance towards data quality, think about ways to improve your data, but don't obsess over it.  Instead, understand the shortcomings of what you've got, take advantage of what's good, and get out there and use your data!

No comments:

Post a Comment