There is an often repeated saying in data analysis: “Garbage in, garbage out”. What this means, is that if you start with bad data (garbage), you’ll only get “garbage” results.
Data comes from various sources, and with the more sources you amalgamate your data from, the more chances you have of accumulating ‘garbage’ data. When this happens you need to “clean” the data, which makes it usable for further analysis. Data cleaning becomes the first step in any project, ensuring the data quality, correctness and consistency in order to get accurate results from any analysis performed.
What happens if the data is not clean?
- Wrong assumptions
- Misrepresentation of facts
- Bad decisions
- Negative impact in overall business
In short, data cleaning can be defined as removing duplicates, errors, incomplete and inconsistent data before any further analysis is performed, or when using machine learning models. Though data cleaning requires additional time and effort on the project, it makes the future analysis and reporting much easier by avoiding bad decisions and providing powerful insights from the data.
When building machine learning models, every data scientist should ensure the data is cleaned properly before training any model. Training algorithms takes hours (and sometimes even days!) and the use of garbage data causes the model to perform in a poor way resulting in wrong predictions and waste of time and money.
The major steps in Data cleaning are given in detail below:
Removing unwanted data
Once we have our starting dataset, we need to look for unnecessary data - this includes duplicate or irrelevant observations which are very common when we scrape data from multiple sources. Removing unnecessary data gives us the neat and clear dataset which makes the analysis more efficient. For example, if we are performing house price prediction in Dublin, we can remove the data that includes other counties except Dublin.
It is normal to have incorrect data entries in any dataset. These unusual values violate the assumptions of statistical analysis. If any value that is noticed to stand out or seen as impossibly a higher value when compared to other data points then we should fix the error or remove from the dataset.
Handling missing Data
Missing values are the data that is not stored for some variables in the dataset. This reduces the statistical power and also reduces the accuracy of resulting analysis. Normally, missing values can be deleted or replaced with any arbitrary values, like mean, mode, or median.
Uniformity is the friend of scalability - Karl Iagnemma
Since we collect data from different sources, standardization is important for transforming data into particular format in order to improve data quality. This reduces operational costs and enhances productivity. Cleanliness and consistency are the two major hallmarks of standardizing data which makes data sharing easy and helps analysts to enjoy the quality data and come up with amazing results.
Data types should be uniform in a dataset and type conversion comes into play to make sure whether all the data types are same. For example, if you have a column of euro figures for electricity bills but in US dollars for gas bills then converting all the prices to euros or US dollars helps to maintain a standard currency type.
Validating the data
The final step in the data cleaning is validating your data. This is to ensure whether it has high quality, formatted properly and checking if enough data is there or not. Here we need to make sure that the data is clean and there is no missing or inaccurate data.
Tools for Data Cleaning:
Most commonly used data cleaning tools are MS Excel, Python, and Tableau. Data Scientists spend 60% of their whole time on data cleaning and the number of tools are rapidly increasing which can be used to clean data of any size. Some of the best data cleaning tools right now in the market can be found here
In order to get more information on data cleaning, have a look at Coursera, Data Camp, and Udemy websites which offers best online courses in affordable prices.
Posted: 09/11/2022 10:38:53