We discussed how data is used for startup applications in our recent Data Science Essentials for Startups: Where to Start and How To Prioritize webinar. Facial recognition, robotics, and other cutting-edge A.I. technologies rely on large data sets to make better decisions faster. However, many startups are so eager to implement data-driven solutions that they don’t consider an essential factor: data quality.
The quality of data directly affects the efficacy and agility of data science applications. Even if you have a large data set, it may not be usable depending on how it’s collected and what information you can extract from it. So rather than waste your time and resources, we want to create data solutions that offer quality learning. That is the focus of Echelon DS and the driver behind each of our collaborative projects with startups.
Are you using quality data?
Many startups focus so intensely on data quantity that they don’t think to determine whether their data sets are usable or not. To determine whether your data is of the quality necessary to help you make informed decisions, you must consider the seven factors below.
We will use the example of a data set of facial images to show how quality data often matters more than sheer quantity. For example, if you were building a program to detect facial expressions, you would need to determine whether all of these qualities are present in each image.
- Fidelity – The data set needs to contain easily identified features consistent with what you’re attempting to model, at the point of collection. For example, if you’re collecting images for facial expression detection, blurry or warped images would be of little use.
- Quantity – The advantage of big data is identifying and then analyzing or predicting trends within the set. If you only have 50 images to build a model to detect facial expressions, you likely don’t have enough to detect trends or variance.
- Variance – If you’re trying to detect generic facial expressions, 1,000 images of the same person would not contain enough variance to be applicable to multiple users. Variance also prepares AI machines to learn to adapt to different environments and understand complex inputs.
- Bias – Implicit bias in your data sets can skew results and give you an inaccurate depiction of how your product or service will work in the world. For example, if you’re using facial expression detection, but don’t include multiple ethnicities or sexes, you may be training your AI to be less effective for certain users.
- Consistency – Data needs to be varied to eliminate implicit bias, but it may also need to be segmented to accelerate learning. For instance, the facial features and dynamics of a child can be quite different than that of an adult. Therefore, facial expression detection would need to accumulate a variety of data sets and create consistency between multiple variables.
- Feature-rich – Data that limits machine learning will only impede your progress. As an illustration, only using black and white or close-up images may limit facial expression detection capability.
- Embedding assumptions – On the flip side, you might exclude certain images because you have assumed they are low quality. However, they could potentially add to the variance and might not impede machine learning at all. For instance, black and white photos may not hinder facial expression recognition, yet if you exclude them, you could be losing out on valuable information.
These qualities are essential in finding data sets that will accelerate learning for your startup, but acquiring quality data can be trickier than many startups realize. Therefore, you need to ask yourself these questions before you begin designing your data solution, to ensure your hard work will give you the results you want:
How much data do you need?
In the example of the facial expression recognition data set, you need to know how many images you need to create an accurate analytics tool. Is 5000 enough to create a precise tool, or is 50,000?
Does it contain the features you require?
If you’re trying to detect facial expressions accurately but only include 50% happy or neutral faces, you won’t be able to detect all facial expressions, especially those more subtle. Using only images of eyes or mouths too may limit your machine learning capabilities and reduce the efficacy of your analysis.
Do you have the resources?
The costs of capturing accurate, quality data can be a limiting factor for many startups. However, even if you have the budget, you want to go through the checklist to ensure the data you’re using will serve the purpose and problem you’re trying to solve. Data that does not follow these checklist items may only act as a hindrance to your success.
Am I allowed to capture this data without penalties?
Due to data governance laws, there may be certain types of data you cannot collect. For example, HIPPA protects certain medical information from being disclosed for research purposes. General Data Protection Regulation also protects the right to erasure, which means you need systems in place to remove individuals from data sets. Failure to do so could result in large (hundreds of thousands of dollars) fines.
Data quality is the #1 differentiator between startups that move and startups that are stuck in place. Contact us for solutions to create more accurate data solutions at an affordable price.