The Power of Clean Data
DATA CLEANING
There is an often repeated saying in data analysis: “Garbage in, garbage out”. What this means, is that if you start with bad data (garbage), you’ll only get “garbage” results.
Data comes from various sources, and with the more sources you amalgamate your data from, the more chances you have of accumulating ‘garbage’ data. When this happens you need to “clean” the data, which makes it usable for further analysis. Data cleaning becomes the first step in any project, ensuring the data quality, correctness and consistency in order to get accurate results from any analysis performed.
What happens if the data is not clean?
- Wrong assumptions
- Misrepresentation of facts
- Bad decisions
- Negative impact in overall business
When building machine learning models, every data scientist should ensure the data is cleaned properly before training any model. Training algorithms takes hours (and sometimes even days!) and the use of garbage data causes the model to perform in a poor way resulting in wrong predictions and waste of time and money.

The major steps in Data cleaning are given in detail below:
-
Removing unwanted data
-
Handling outliers
-
Handling missing Data
-
Datatype Conversion
Uniformity is the friend of scalability - Karl Iagnemma
Since we collect data from different sources, standardization is important for transforming data into particular format in order to improve data quality. This reduces operational costs and enhances productivity. Cleanliness and consistency are the two major hallmarks of standardizing data which makes data sharing easy and helps analysts to enjoy the quality data and come up with amazing results.
Data types should be uniform in a dataset and type conversion comes into play to make sure whether all the data types are same. For example, if you have a column of euro figures for electricity bills but in US dollars for gas bills then converting all the prices to euros or US dollars helps to maintain a standard currency type.
-
Validating the data
Tools for Data Cleaning:
Most commonly used data cleaning tools are MS Excel, Python, and Tableau. Data Scientists spend 60% of their whole time on data cleaning and the number of tools are rapidly increasing which can be used to clean data of any size. Some of the best data cleaning tools right now in the market can be found here.In order to get more information on data cleaning, have a look at Coursera, Data Camp, and Udemy websites which offers best online courses in affordable prices.
Posted: 09/11/2022 10:38:53