Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.
Statistics, Machine Learning, Data Science, or Analytics – whatever you call it, this discipline is on rise in last quarter of century primarily owing to increasing data collection abilities and exponential increase in computational power.
Field is drawing from pool of engineers, mathematicians, computer scientists, and statisticians, and increasingly, is demanding multi-faceted approach for successful execution. In fact, no branch of engineering, science, or business is far from touch of analytics in any industry. Perhaps you, too, are interested in being, or already are, a data scientist.
However, as one journeys through his/her career in analytics, some truths start becoming evident over time. And while none of them are ground-shattering, they often surprise novices in the field. So, it’s worthwhile to know 11 absolute facts of data science.
Data is never clean
Analytics without real data is mere collection of hypotheses and theories. Data helps test them and find the right one suitable in context of end-use in hand. However, in real world data is never clean. Even in organizations which have well established data science centers for decades, data isn’t clean. Apart from missing or wrong values, one of the biggest problems refers to joining multiple datasets into coherent whole. Join key may not be consistent or granularity or format may not be suitable. And it’s not intentional. Data storage enterprises are designed and tightly integrated with front-end software and user who is generating data, and are often independently created. Data scientist enters the scene quite late, and often is just “taker” of data as-in and not part of design.
You will spend most of your time cleaning and preparing data
Corollary to above is that large part of your time will be spent in just cleaning and processing data for model consumption. This usually annoys people new to industries. With brilliant mind bursting with sophisticated machine learning methods, spending three-fourth of the time with just data wrangling seems waste of talent and time. Often this leads to dissatisfaction and lack of attention – errors from which can come to bite even the most fanciest of the algorithms. If you cannot do this with equanimity and focus on big picture, then perhaps you should aim for research in statistics rather than career in data science.
Need for Bayesian Approach
Data science is grouping of hypothesis testing. You need to have going-in conviction which you need to demonstrate right or wrong in light of perception from data.
More grounded is your going-in conviction, all the more counter-confirmed you have to demonstrate conviction off-base.
That, basically, is Bayesian approach. Be that as it may, while demonstrating your hypothesis directly through data is vital, demonstrating elective hypothesis wrong is additionally similarly essential.
Alternative to Bayesian believing is to give your data a chance to disclose to you stories.
This can be dangerous in light of the fact that cut and diced some way, data will dependably recount a story.
Be that as it may, without from the earlier conviction, story may not be valid as a general rule. This is regularly instance of knowledge of the past predisposition and poor research.
On the off chance that you need to discover contrasts in two gatherings, you can simply discover a few. There are a huge number of human attributes that some will turn out various just by shot.
That doesn’t imply that those attributes made somebody not quite the same as others.
Then again, on the off chance that you have sensible hypothesis about what could be causing contrast, you can check on the off chance that you are correct or not.
At last, possibly you clarify comes about because of model in view of your comprehension, or you adjust your understandings.
Just because analytic model is great doesn’t mean it will see light of day
As fun as data science is, there is more to the world than your analytical model. If you see about a third or more of your work getting implemented or used then consider yourself lucky. Notwithstanding analytic capabilities, analytic project get shelved for various reasons all the time, including, data changed, problem changed, no one interested in solution, implementation too expensive, benefit not worth the cost, someone else did it first, and solution too advanced for its time. Be calm and carry on.
Point is not about counter, but about importance of internalizing these realities of industry we want to be part of. Difference companies and industries might be at different spectrum of these facts, but collectively knowing and understanding these ‘facts’ will make one a more satisfied, broad minded, and better data scientist.