In today’s rapidly transforming world, Data has emerged as a key enabler for dealing with current business challenges and future opportunities. By 2024, 75% of organizations are expected to establish a centralized data and analytics data center of excellence to support federated Data & Analytics initiatives and prevent enterprise failure, according to Gartner. This article highlights the common data quality challenges faced by enterprises and highlights the contributions of Cigniti Technologies, a leading AI & IP-led Digital Assurance and Digital Engineering Services company, in addressing these issues.
Our guide through this exploration is Srinivas Atreya, Chief Data Scientist at Cigniti Technologies. Srini shares his insights on the impact of data quality on AI effectiveness, strategies to mitigate biases, and ways to improve data quality for optimal AI results.
Gary Drenik: What are the typical data quality issues that Enterprises face?
Srinivas Atreya: There are several data quality issues that enterprises commonly face. These issues can hamper the decision-making process, lead to incorrect analysis, and negatively affect the operations of the enterprise. Here are some of the most common data quality problems:
- Inconsistency: This refers to data that doesn’t align across different systems. For example, one system may list a customer’s address differently than another.
- Inaccuracy: Inaccurate data can come from many sources, including human error, system glitches, or data transfer problems. This can include things like incorrect customer details or incorrect financial figures.
- Incompleteness: Missing data is a common issue in many enterprises. This can happen for a variety of reasons, including data not being collected, data being lost in transfer, or data being accidentally deleted.
- Duplication: Duplicate data entries can cause a lot of confusion and can lead to incorrect analysis. This often happens when data is merged from different sources without proper checks for duplicates.
- Outdated Information: Data can become outdated quickly, especially in fast-moving industries. This can lead to decisions being made based on old information, which can have negative consequences.
- Irrelevancy: Not all data collected is useful or relevant. This irrelevant data can clutter databases and make it harder to find the information that is actually needed.
- Poor Data Governance: Without proper rules and processes in place for handling data, many data quality issues can arise. This can include things like not having a standard format for data entry or not having procedures in place for checking data accuracy.
- Lack of data standardization: When data is entered in different formats, it can lead to problems when trying to analyze or use that data. For example, dates might be entered in different formats in different systems, making it difficult to compare or merge the data.
- Lack of data integrity: This happens when the relational rules that exist between data sets are not maintained. For example, if a database has a rule that every order must be associated with a customer, but an order is entered without a customer, this would be a data integrity issue.
- Data security issues: This is not a quality issue per se, but if data is not secure and breaches occur, the quality of data can be compromised. According to a recent Prosper Insights & Analytics survey, 13.5% of individuals aged 18 and above are not likely to give their personal information, such as phone number or email, at the register when store associate asks. All these concerns are driven by data security issues.
Addressing these data quality issues often involves both technological solutions, like data management and cleaning software, as well as organizational changes, like implementing data governance policies and training staff on data handling best practices.
We at Cigniti Technologies are at the forefront of this, helping our clients with technology solutions ranging from Building Lake Houses to managing end-to-end data movement pipelines. We also help our clients with Data Governance & Master Data Management solutions to further enhance the quality of their internal datasets.
Drenik: How can AI help with improving data quality in Enterprises?
Atreya: AI has significant potential to improve data quality within enterprises. Here’s how it can help:
- Automated Data Cleaning: AI can automate the process of data cleaning, including identifying and removing duplicates, filling in missing values based on learned patterns, and correcting inconsistent or inaccurate data. Machine learning algorithms can be trained to recognize errors that are often hard to spot with manual review.
- Anomaly Detection: Machine learning models can be used to identify outliers or anomalies in the data that could indicate errors. For instance, a machine learning model could be trained to identify transactions that fall outside of the normal range for a business, which could indicate fraudulent activity or data entry errors.
- Data Validation: AI can help to ensure that the data collected is accurate and relevant. Machine learning models can be trained to validate data input based on defined criteria and flag potential issues for review.
- Data Standardization: AI can automate the process of transforming data into a common format. This can help to ensure consistency and improve the accuracy of data analysis.
- Predictive Analytics: AI can be used to predict missing or unknown values in a dataset based on patterns in the data. This can improve the completeness and accuracy of the data.
- Natural Language Processing (NLP): AI can use NLP to extract valuable information from unstructured data like emails, customer reviews, and social media posts. This can greatly enhance the quality and depth of data available for analysis.
- Data Governance: AI can help to enforce data governance rules automatically, ensuring that all data comply with set standards and policies.
- Data Security: AI can monitor and detect unusual activities in real-time, enhancing the security of data and preventing unauthorized access or breaches. This is especially relevant in light of recent findings from the Prosper Insights & Analytics survey, which revealed 21.6% of individuals aged 18 and above are very concerned about the privacy of their personal information when shopping in a store.
- Metadata Management: AI can help in creating, managing, and maintaining metadata, which is crucial for data understanding, categorization, and utilization.
By implementing AI solutions, enterprises can not only improve the quality of their data but also free up valuable time and resources that can be better spent on analysis and decision-making. However, AI solutions should be carefully managed and monitored to ensure they are working correctly and not introducing new errors in the data.
We at Cigniti Technologies work with a large bank in North America helping them with Data validation techniques using ML as part of their credit approval pipelines. This is crucial to ensure that the downstream credit approval ML models work on accurate datasets.
We work with another large US retailer helping them use ML routines that can automatically detect data anomalies before they get into downstream reporting applications. This has improved the overall data governance policies as well as helped IT gain the trust of their business counterparts.
Drenik: How does the quality of data influence the effectiveness of AI systems?
Atreya: The quality of data is crucial to the effectiveness of AI systems. If the data used to train an AI model is inaccurate, incomplete, inconsistent, or biased, the model’s predictions and decisions will be too. High-quality data results in AI systems being able to make more accurate predictions, provide relevant recommendations, and effectively automate processes. Therefore, maintaining data quality is crucial for achieving the best possible outcomes from AI systems.
One assumption a lot of ML practitioners make is that by using “Big Data” we can cover up the problems due to bad data quality. This is never true – in fact data quality issues in larger data sets tend to create even worse outcomes for ML models. A simple model trained on good quality data will any day win over a complex model trained on “Big Data” of dubious quality.
Drenik: Can poor data quality lead to biased AI decisions?
Atreya: Yes, poor data quality can indeed lead to biased AI decisions. Bias in AI outcomes often stems from biased training data. If the data used to train an AI system has inherent biases, the AI will likely reproduce and potentially amplify these biases. This can have serious implications in various fields like hiring, lending, law enforcement, among others, where AI is increasingly used for decision-making. Therefore, it’s essential to ensure the data used is representative of the real world and free from discriminatory biases.
This is especially true of protected attributes like gender & ethnicity. Care should be taken never to directly use these features as part of the ML model training process – as a best practice even surrogate markers should be avoided.
Statistical parity tests should always be conducted to ensure that the model decisions are not unfairly influenced by these protected features.
Drenik: What strategies can be used to improve data quality for better AI outcomes?
Atreya: There are several strategies to improve data quality for better AI outcomes. Firstly, you can employ data-cleaning techniques to remove errors, duplicates, and inconsistencies. Secondly, it’s crucial to ensure data is complete and does not contain missing values that can skew the model’s understanding. Thirdly, you can use techniques like data augmentation to enhance the diversity and representativeness of your data.
The human aspect of data should never be neglected. All data is finally a transcription of human memories – so it should never be assumed as correct or infallible. An insightful perspective from a recent Prosper Insights & Analytics survey reveals that 47.8% of individuals aged 18 and above have denied permissions to mobile apps tracking them, showcasing the growing concern for digital privacy. It is important to understand the motivations and historical biases behind any process generating a particular dataset of interest. Multiple human perspectives need to be used to ensure that any particular dataset that will be used to train ML models will not propagate or amplify existing societal biases into the future.
Finally, implementing proper data governance strategies, which include establishing clear procedures for data collection, storage, and use, can help maintain data quality over time.
Drenik: Thank you, Srinivas, for your valuable insights into the typical data quality issues faced by enterprises and also for sharing the impactful work that Cigniti Technologies is doing to address these challenges in collaboration with clients. Your perspective on data governance strategies, including clear procedures, provides a practical approach for sustaining data quality, ultimately enhancing AI outcomes and fostering trust in the digital realm.
Read the full article here