Enterprises, big and small, are leveraging varying amounts of AI not only to make better business decisions but in the process are able to predict what will satisfy their customers.
However, why is that some companies aren’t able to see as good a result as their competitors despite using the same algorithm? The answer lies in the quality of data.
One can have the most sophisticated algorithms and the most capable data scientists but unless the data that gets processed is clean and of good quality, there is little that artificial intelligence can do to improve business decisions.
Unfortunately data cleansing is not as simple as it sounds. IndiaAi reached out to two experts who extensively use AI in their business to understand how they carry out data cleansing in their organization.
Raghav Ghaiee, Founder, Queueme Technologies
Ghaiee, a data scientist who has previously worked for companies like Adobe and Apigee (which has now merged into Google Cloud) says there are multiple challenges with data cleansing including some practical ones like screen size. “The cleansing techniques can become very iterative and time consuming. Personally, for me and I am sure others will resonate with this as well, is to view datasets on laptops screens, and having to wait for the big data query jobs to complete and give out outputs from big data tools.”
Ghaiee who is now founder of Queueme Technologies, which is a multi-product SaaS startup revolving around scheduling, queueing and data science, suggests the following steps for data cleansing in AI.
- Commas in a data field: Most data science is done on tabular data which is rows and columns. And many times the input file is in CSV (comma separated values) format . In case of CSV format, the field separator is a comma (that is ","). “In such a case, we need to be very sure about the number of columns contained in each row and that no field contains commas. A single comma in a data field itself can break the whole analysis,” Ghaiee says. He further explains with an example. “If the data contains salary information and the salary field itself has commas in it, then all the analysis will be incorrect as the salary field will be broken into separate columns due to the comma.”
- Numeric data and Boolean data in quotes: Data scientists need to make sure that numeric data and boolean data is represented without quotes. “Being aware of these issues in the data and making the necessary type conversions can save a lot of time. For example, "12.5" can be considered as numeric 12.5, but "12.5.1" is simply not convertible to a numeric type,” he remarks. “I would actually also want to see how many have been converted in a valid way. If a lot of data contains non-convertible inputs, it's better to stop the analysis and ask questions about the data source and its quality.”
- Plotting and visualizing data: Many believe that this makes the data come to life. “By using plots like scatter plots and bar diagrams we can be more confident about the nature of data. Very weird looking patterns can indicate something fishy.”
- “For example, if we are analysing the data of a country which contains age, ‘has diabetes’ and the plots shows counter intuitive correlations between the ‘has diabetes’ field and the age field, then something is wrong in the data or in the sampling technique used.”
- Missing values: If the data has missing values, or "error" values, then either try to delete these data points or try to estimate the missing values using algorithms like Expectation Maximization.
- Computing the summary statistics of numeric fields: If one computes simple statistics like counts, average, median, minimum and maximum of the numeric data fields, a lot of sanity checks get covered. Moreover, computing these summary statistics separately for various categorical fields with a “group by” operation can make one more confident about cleanliness of these fields.
Anish Ravindranathan, security architect at Tata Digital
Ravindranathan who is a veteran in the cybersecurity industry uses machine learning which helps him to sift through millions of data to detect any possible anomaly that could be an indication of malware attack.
“The main challenge is to get 100% fill rate while collecting data because of it there will be a lot of missing data which means you will not get a rich experience while analyzing the data,” Ravindranathan says. “Cleansing is the most important and complex part of data science for your algorithm to work accurately,” he says, “If you have multiple sources of real world data then the data cleansing should be a continuous process with some algorithm in place.”
Ravindranathan follows the following steps for data cleansing which ensures he gets the best results from artificial intelligence and machine learning.
- Data Correction: It is difficult to get a data set which is collected accurately with correct format hence this is the most important step. There would be missing data or data with incorrect format which need to be normalised. “The most common errors are found in date, text, numbers etc. We need to also ensure there are no special characters,” says Ravindranathan.
- Data Parsing: “Identify the data fields from different sources and make unique fields required for your analytical purpose. Now collect, import and merge the data to the unique fields identified,” he says.
- Data Duplication: - The chances of data duplication is high if data is getting collected from multiple sources. “So in this step you have to identify duplicate values and then make a set of duplicate values. For example, 10 email addresses with abc@abc.com and its associate field value will make 1 set,” he says. “The next step is to determine the data rich set and merge the data. Finally remove the duplicate value from each duplicate data set,” he says.
- Data Export: Once you have clean data, the next step is to consolidate and export the data to the database.
- Data Validation: The final step is to validate that the cleansing process is working as expected. “We need to define a standard for each field and run a query on each of these to check whether it is matching with the defined standards.” Finally run a sample report to see whether you are getting a desired output.
Data scientists and AI experts say that the data cleansing process varies between industries but there is no standard that has been defined yet. “There is no standard process but in a nutshell it is all about correcting and standardising the data to get a rich analytical report which enables the business to make appropriate decisions,” says Ravindranathan. Nevertheless, the above guide holds true for more or less all industries, say experts.