For a very long time I have been noticing that most people have notions about data science, they think by studying and learning to work on python, R and other tools, they can become a data scientist. Python, R, SQL, are all important in data science but these are all could be learnt easily once you start working in them, it is really the least challenging thing to learn all these tools and languages for a data scientist. What really matters is the knowledge in and about data, a data scientist should have the intelligence to vision through the raw data.
Let us focus on the real thing, we need to think beyond the tools and start concentrating on developing relationships with data because that is the key to become a data scientist.
A data scientist must mainly possess the skill in understanding the potential of data, its value, threshold and flexibility. We call this as Data Processing. In data science, data will be the dish for which data itself will be the ingredient, which means, the main goal of data processing is to find salutary data by crunching and filtering the raw data.
The process of Data processing:
1. Gathering of the data. By various platforms, surveys and mediums data(in all forms) will be gathered, this data will not be validated while being picked and that is why it is called as raw data. It is the data in its raw form.
2. Cleansing of data. The gathered raw data will go through a validation process in which the useless data will be eliminated from the main data, only the useful data will be filtered through this process.
3. Modification of data. The thoroughly validated data will be rebuilt, manipulated and will be merged with other data if necessary.
4. Processing phase. This is the ultimate phase where the processing of data takes place, here is where the final solution for a problem will be found. Machine learning algorithms and methods are used in this phase.
5. Interpretation of data. The final solution of data could be easily read by the data scientists but for the non-data scientists the interpretation of salutary data into an easily readable and understandable way is important. Data visualization is used in this phase of data processing.
6. Data Storage. It is extremely necessary to store the data especially to store the statutory data so that it could be reused in future whenever necessary. But storing data was a huge concern for the businesses but due to the concept of hadoop in big data this concern has been easily resolved.
This is probably the simplest and shallow explanation on the phases of data processing.