The "bad data" lament
Updated: Mar 29, 2020
Yogi Berra famously said “Baseball is 90% mental. The other half is physical.” And, so it is with sales forecasting and other predictive analytic models—90% data cleansing, the other half is math.
“But we have bad data.” The mantra of sales operations.
To which we ask… Who has perfect data?
Messy data. Bad data. Fuzzy data. We call it, the “bad data lament,” and, to varying extents, it is eternally a problem. So, what can you do about this? Well what do you want to do with your data? If you are just looking at a CRM repository as a deal database, then you can probably just ignore it and move on. So what if someone did not record an activity last year? Or they didn’t enter an opportunity until it was well developed?
Here’s the rub though. If you try to build forecasting models on bad data, you will find that you will spend almost all of your time (90% is an underestimate in our experience) manually or programmatically cleansing your data. This stands in front of your attempts to build any predictive models. Usually, this is where the forecasting project dies and the data cleansing project begins. Problem is, data cleansing is never done. But that does not have to mean the end of forecasting.
Fuzzy data is still usable. And consistent fuzziness is actually good. The sins of the past will continue in the future. We have done our share of data cleansing and have seen all sorts of problems. That’s why we put a multitude of data quality filters in Funnelcast. These filters do a pretty good job of fixing and eliminating bad data so they don’t interfere with your models.
So, because we built all these data quality filters into Funnelcast, we have a simple answer to the bad data lament. Fuhgeddaboudit.
Before you start a cleansing project, let Funnelcast work its magic. Run a few backtests and see how good the models are. You might be surprised. We have found tha a lot of messy data trumps a small amount of clean data. E.g., for one project (with all sorts of messy data problems), the Funnelcast one-year prediction had an average absolute error of under 10% in a longitudinal backtest. (Yes, errors were considerably higher without the data quality filters.)
How do we do that test? Simple. Set the test date back in time. This forces Funnelcast to only use data up to the test date to train the model. Then run a one-year forecast and compare that to the actual results. Measure the absolute error (so negative errors don’t cancel positive errors) and divide by the forecast. This gives you the absolute percent error of the forecast. Then advance the test date by one day and do it again. Keep going for a full year. Here’s what the longitudinal backtest looks like for that project.
And accuracy improves considerably with more events. If your business has 100 events in a forecast period you can expect that the percentage errors will be smaller than if you were expecting only ten. E.g., for another project with considerably more data we measured average absolute errors of 2.5% at one year (forecasting several hundred events).
So, don’t fret the bad data lament. Fuhgeddaboudit, and let Funnelcast try first.