In the movie The Sixth Sense, Cole Sear said “I see dead people”. For author Q. Ethan McCallum, whose excellent book Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work just came out, he likely sees bad data just about everywhere.
So just what is this monster called bad data? McCallum writes on page 1 that it is difficult to explicitly define what bad data is. He writes that some people consider it a purely hands-on, technical phenomenon, namely missing values, malformed records, and incompatible file formats. But also notes that it is much more than that.
Chapter 1 notes that bad data includes data that eats up your time, causes you to stay late at the office, drives you to tear out your hair in frustration and more. It’s data that you can’t access, data that you had and then lost, data that’s not the same today as it was yesterday. Ultimately, bad data is data that gets in the way. And there are so many ways to get there, from bad storage, to poor representation, to misguided policy.
In the book, McCallum gathered numerous authors to detail how bad data issues have affected them and what they have done to deal with it, and remediate it.
Most books that have close to 20 authors suffer from poor organization, repetitive material and overall lack of structure. This title suffers from none of that, and provides the reader with an excellent guidebook to use to ensure that they don’t run into the garbage in, garbage out scenario when dealing with data. This is particularly important given that we are living in a data driven society.
While ostensibly a dry topic, the authors expertise is such that they are able to make the text most interesting. This is particularly true in chapters 2 Is It Just Me, or Does This Data Smell Funny?, 8 -How Chemists Make Up Numbers and 16 - How to Feed and Care for Your Machine-Learning Experts.
Another interesting chapter is 14 on the Myths of Cloud Computing. Steve Francia debunks 4 pervasive cloud myths including the notion that the cloud is a great solution for all infrastructure components and the cloud will always save you money.
The beginning of the book has a lot of code that may turn off some non-programmers, After chapter 7, the coding examples are limited, and the message the authors give is definitely worth reading.
While hardware is cheap and bandwidth even cheaper; the book shows that bad data is extremely expensive. Bad data has significant and always negative consequences.
The book takes a highly systematic approach to data quality analysis, which is a most important task. Given the importance of the topic, Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work is a title that has relevance for nearly everyone in IT, and should be read by anyone who is concerned with the integrity of their organizations data.