,

Getting Data Science Projects Right: 5 Common Data Mistakes to Avoid

Data. It’s the new oil. The new gold. The new fuel that drives business success. Why? Because today, a successful company has customers at its core – and the more you know about your customers, the better you can understand their needs and evolve your products, services, and marketing strategies to serve them. How do you understand your customers better? You guessed it – with data. Or, more accurately, by utilizing data science tools and techniques to examine large amounts of data to uncover hidden patterns, correlations, connections and other insights in order to identify opportunities and make informed, evidence-based decisions.

Yes, the days of one-size-fits-all advertising, competing solely on price, and gut-feeling decision-making are well and truly over. Today, organizations of all sizes are investing in data science to help them improve operational efficiency, optimize marketing campaigns and customer service programs, respond more quickly to emerging market trends, identify opportunities for new products and services, gain a competitive advantage over rivals, and ultimately increase revenue and profit.

When leveraged correctly, a data science project can unleash any or all of these advantages for the business. However, when it comes to data-driven insights and decision-making, accuracy counts above all else. When mistakes are made with data, then all analysis and reporting that follows is going to be inaccurate or incomplete. And if reporting and data analysis is wrong, then your decision-making is going to be flawed, too, and no advantage will be gained whatsoever.

Let’s consider five of the most common data mistakes that plague data science projects.

5 Data Science Mistakes to Avoid

1. Garbage In, Garbage Out (GIGO)

Data is indeed the fuel that drives business success – but only insofar as the fuel you use to power your business engine is clean and good. It will make no difference whatsoever how good or powerful your data science skills or software programs are if you’re fueling up with bad data that hasn’t been cleaned and rigorously checked for accuracy. Using dirty data to fuel your data science project is just like filling your gasoline car with diesel – you think you’re powered up and ready to go, when in fact all you’re doing is causing serious damage to your engine.

To give you an example of how bad data can lead to disaster, consider the Mars Planet Orbiter – a robot space probe manufactured by Lockheed Martin and launched by NASA in 1998. The mission was to learn more about the planet Mars – however, as the craft approached the planet, bad data led to its thrusters being fired incorrectly. The problem was that one piece of software supplied by Lockheed Martin calculated the force the thrusters needed to exert in pounds of force – but a second piece of software supplied by NASA took in the data assuming it was in the metric unit, newtons. What happened? Well, the error caused the craft to dip 105 miles closer to Mars than expected – and the whole thing went up in flames. As a result, NASA’s quest to learn more about the red planet was set back many, many years, and a $327.6 million mission was burnt up in space.

Whoops!

But it’s not just astronautics companies that are guilty of fueling engines with dirty data. In fact, it’s quite a common problem. According to recent research, 62% of organizations rely on marketing/prospect data that’s up to 40% inaccurate, 94% suspect their data is erroneous, and 40% of business objectives fail due to imprecise data. The lesson is simple – if you put garbage in, you get garbage out. At the outset of a data science project, be sure to check the accuracy of all data carefully. Screen it and clean it meticulously – otherwise, much like the Mars Planet Orbiter, it won’t be long before you crash and burn.

2. Not Defining a Clear Research Goal at the Outset

Every data science project needs to be acutely focused on answering a very specific business question. You can collect all the data in world, but when the time comes for meaningful processing and analysis, if you’ve lost sight of the original question you were trying to answer, any “results” you end up with will likely not tell you anything useful at all. Instead, you need a clear research goal – one that, importantly, will lead to advantageous changes in business operations. And it needs to be defined clearly at the outset.

This comes down to asking the right question. The right questions are those that Principal Data Scientist at iRobot (formerly at Microsoft) Brandon Rohrer defines as being “sharp”. In a blog post called “How to Do Data Science”, Rohrer writes:

“When choosing your question, imagine that you are approaching an oracle that can tell you anything in the universe, as long as the answer is a number or a name. It’s a mischievous oracle, and its answer will be as vague and confusing as it can get away with. You want to pin it down with a question so airtight that the oracle can’t help but tell you what you want to know. Examples of poor questions are ‘What can my data tell me about my business?’, ‘What should I do?’ or ‘How can I increase my profits?’ These leave wiggle room for useless answers. In contrast, clear answers to questions like ‘How many Model Q Gizmos will I sell in Montreal during the third quarter?’ or ‘Which car in my fleet is going to fail first?’ are impossible to avoid.”

Learn more about data science goal-setting in our previous post, “The Importance of Defining a Research Goal in a Data Science Project.”

3. Cherry Picking

Cherry picking is the practice selecting only the results that confirm a particular claim, expectation, or position, and removing all other data that does not. In other words, it’s choosing data that “fits” a hypothesis or argument, instead of reporting all the findings.

Cherry picking reduces all credibility of any data science findings for the simple reason that it’s dishonest. If all you’re focusing on are the results that confirm your viewpoint, then you’re missing the true story the data is telling. In an article entitled “Why Data Driven Decision Making Is Your Path to Business Success” for Datapine, Sandra Durcevic says that it can be all too easy to see only the data we want rather than the big picture. And when we cherry pick only the top insights and ignore or hide information that doesn’t support our viewpoint, it leads only to confirmation bias upon which bad business decisions are made.

4. False Causality

As the saying goes, correlation does not imply causation. False causality in data science is when we erroneously assume that when two events occur together, one must have caused the other. For example, let’s say the marketing department launches a new Twitter campaign. In the days that follow, a sharp uptick in organic traffic to your business’s website is observed. A knee-jerk conclusion might be that the Twitter campaign is driving this traffic to your site – i.e. one action directly caused the other.

But how accurate is this conclusion in reality? Were there any other factors at play? For example, is there another campaign the marketing team forgot to mention that might actually be responsible for the traffic increase? What about seasonality, or any other variable? Just because there’s correlation does not mean there is causation, as the following Spurious Correlations demonstrate:

(Image source: tylervigen.com)

5. Relying on Summary Metrics

Summary metrics are simply averages of your overall metrics. Of course, averages are important, but they often don’t reveal the whole picture. Summary metrics can be misleading because there might be a lot of variation within a dataset that a summary simply doesn’t reveal.

For example, you might run a Google Ad that receives an average of 1,000 clicks a week for a month. Great – but what that summary doesn’t tell you is that 75% of those clicks were received across one weekend in the middle of the month, with the remaining 25% spread out very thinly across the rest of it. As such, the only way to understand the true response to the Ad is to drill down into the metrics on a day-by-day basis to get an accurate picture of audience engagement and behavior.

Final Thoughts (and a Bonus 3 Data Mistakes to Avoid)

Those are the five most common data mistakes that researchers often make while conducting data science projects. There are others, of course – so here’s a bonus of three further mistakes to avoid:

  1. Overfitting: A modelling error that occurs when you create a complex model to fit and explain a complicated data set. However, even though such models work well for the data you already have, they tend to break down when you add more data, and therefore struggle to predict future patterns and trends.
  2. Data Fishing: Also known as data dredging, data fishing is the misuse of data analysis in an effort to prove that there are statistically significant patterns in data when in fact none are there in reality. Often occurs when researchers are not sure exactly what they are looking for, and subsequently become misled by the apparent correlations they discover. Can be easily avoided by defining a hypothesis upfront.
  3. Sampling Bias: A bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. If the bias is not accounted for, it can result in conclusions being drawn from a data set that are not representative of the population you’re trying to understand.

Keeping all of these common data mistakes in mind will help you avoid the pitfalls of many failed data science projects.