Companies have been accumulating data and customer information for years, but extraction of value from large amounts of data still remains the greatest challenge. Artificial Intelligence (AI) and Machine Learning (ML) specifically allow you to get the most out of your data and make the data work or you, producing profits. But the ideal basis – of realistic, correct, and properly labeled data in the right quantity – is almost never the case.
Incomplete data, small amounts of data related to specific examples, missing or unreliable labeling, and data errors are the most common issues hidden in data processing. Data shortages can cause a project to fail: if you accept substandard data, the outcome, inevitably will be substandard as well, regardless of the development team's brilliance or algorithms used. Today we are sharing some of our practical cases of dealing with data issues in machine learning projects.
The common pitfall of incomplete data happens when an otherwise acceptable amount of data does not fully describe a problem due to the subject area being too narrow. There are two solutions to this:
- Starting a new iteration of the development in order to improve the ML system.
- Providing all expected input data at the beginning of the project. This is more of a preventative measure and therefore the more preferable solution.
This problem and its solution can be perfectly illustrated by the following project. The system we developed is designed to predict vehicles’ routes in a specific area. It helps dispatchers to understand the objects’ behavior to better analyze the situation, and increases the time available to react to accidents. The goal was to decrease the number of accidents, maintain the company’s resources, and protect the environment (reduction of accidents leads to decreased harming of nature)
The problem emerged after the project was finished. During development, the client provided only data from one specific geographic area.
After using the system in that area, the customer decided to apply it in a completely different location and I am sure, you can now see the problem clearly. Even though our system delivered better results than the old one – they were still worse than desired. And that was expected since the system wasn’t designed to adapt to different locations.
In our case, we had to launch a separate project to update the system so that it matched the project requirements.
This could have been avoided by providing all of the available input data at the start of the project.
A small amount of data provided on specific examples
Even with a high amount of available data samples, it is almost impossible to get a quality ML model if little is known about the samples.
The general solution is the same – get more data! The more the better.
Sometimes the problem can be solved by splitting the initial task into several smaller subtasks. But if that’s not an option, shifting the focus to another area and changing the project’s goal can work.
We encountered this issue during a project for Wholesale Flights, an American boutique travel agency specializing in business and first-class airfare.
The company’s ticket-purchasing model was clear and straightforward. After browsing the web site, a potential client leaves a request. The travel manager calls him back, picks the most suitable flight options, offers related services, and seals the deal.
The business had been performing extremely well, and the sales system worked effectively until a problem arose. With the growth of the client base, managers could not keep up with the number of requests and process them on time. Conversions stopped increasing and the company risked losing customer loyalty and trust.
The initial goal was to find requests where clients were most likely to buy tickets. This turned out to be an unrealistic goal. The available data contained only a request time, country and city of departure and arrival, type and class of flight. There was absolutely no information about clients.
So, we suggested turning our attention to another area: predicting the probability of what customers are most likely to immediately pick up the phone.
Using this criterion we ranked the leads.
The system, based on Microsoft Azure and cloud computing capabilities, put the promising requests on top of the queue and assigned them to the managers most likely to close the deal.
In only half a year, this solution helped decrease the lead processing time by 3 times, and sales went up by 17% – three times more than the clients’ expectations.
Missing or unreliable labeling
Machine Learning is about teaching the algorithms by “feeding” them the correct answers. So, data must be labeled – sort of marked with “this is correct”.
Businesses tend to shift the task of generating labels to their tech partners. But ultimately it would be much less costly to allocate in-house experts and solve the problem jointly with the developer. The same approach is best when dealing with errors in labeling.
Here is an example of the problem and its consequences we faced in a project for a medical equipment supplier that sell a solution for carotid artery screening.
Ultrasound pictures of the carotid arteries are used to measure the thickness of the carotid artery walls — to help identify cholesterol plaque buildup. An abnormal thickening of the artery walls may signal the development of cardiovascular disease.
The carotid artery screening solution includes a portable ultrasound scanner and a service for interpreting scan results by technicians.
Technicians were spending a lot of time watching the scan videos with all actions performed manually. With a growing number of doctors using the system, the staff began to underperform, the service price continued to increase, while the assessment speed and the quality were dropping off. The process needed to be optimized.
It appeared that all historical records were stored in the reports presented in different formats, and included unnecessary marks on the top of images. Also, it was hard to identify the exact locations of cholesterol plaque buildup and measure the artery wall thickness.
Our customer chose to shift the image labeling task to us. The problem was successfully tackled, but it required us to delve much deeper into the subject field. This took quite a while and was fairly costly, definitely more than the customer would have spent if they’d done it on their own.
As a result, the solution automates artery scanning and detects pathologies by processing video streams, saving working hours of healthcare professionals, and saving money for the provider.
Plenty of corporate systems contain information submitted manually, which inevitably turns into a source for data errors to occur. The data used for testing the Machine Learning model should be error-free. A small amount of random errors that don’t follow any pattern won’t really affect the results. But if there are a decently high number of mistakes that follow a certain dependence, the algorithm can learn to reproduce it.
The only solution is manual data processing by experts who are competent enough to catch the errors and not make new ones.
Sometimes handling data errors is such a large deal that it turns into a whole project, individually. This is what happened with a solution delivered to a maintenance service company.
The company provides a variety of maintenance jobs to their clients – from repairing gas stations to constructing billboards, and much more. Altogether, the clients’ services embrace over 25 categories with up to 200 subcategories in each of them. Considering the company receives up to 1000 requests per day, it needed a big team of engineers. In addition, all of the jobs have different pricing and some of them are made under warranty.
After rendering the requested service, technicians have to fill out a form that includes the type of work performed which affects the billing. The problem was that the engineers had been making mistakes by misclassifying about 15% of tasks. For this reason, the company had to engage a whole team of managers to check the reports.
Managers were more competent and made fewer mistakes. This gave us an opportunity to use their fixes as a baseline to train a model. Eventually, the system was able to detect and fix about 98% of errors providing managers with an alert if something suspicious is detected, as well as suggested options which the system considered more probable.
This drastically reduced the time spent on fixing mistakes.
What if a company’s data is too valuable to share with a third party
Businesses treat data as a key asset, which we consider to be responsible practice. It is very common that some information is hidden even from employees, let alone a third-party development team. But, can there be any solution for building a prototype without exposing sensitive data?
Actually, when it comes to data security, there is no problem because anonymized data is perfectly suitable for developing a PoC. We elaborated a specific approach that we call a "Moon rover": prior to being sent to the moon, moon buggies are tested here on Earth, where conditions close to the real ones are simulated. A similar process is applied to the development of a PoC with anonymized data.
If a company is not yet ready to share its databases for building a model, it can provide an anonymized dataset – a set of real but depersonalized data. The installer, prototype, and tester are to be packed and sent to a remote server where training and verification are performed and conclusions are drawn.
This is a win-win approach: the business will not show any private data until it makes sure the model will behave properly in a full-scale project.
As someone who supervises a data science-based project, you really want to know the basics of getting the most out of your data. It may seem not so challenging to come up with a sensible model when your data is perfect — but in real life it almost never happens. 99% of the time real data has missing values, noise, outliers, excessive information. All those issues that make it harder to use.
For that reason, data scientists consider data preprocessing as the most time consuming — and therefore expensive — part of a DS-based project. Properly processed, well laid-out data and efficient, domain-specific features are key success factors for the project.
Let us tell you more about our projects!