Achieving high-quality data is crucial for gaining competitive advantages through data insights. Poor data quality, such as dirty data with missing or invalid values, can severely hinder the reliability of insights derived from predictive algorithms. To combat this, companies need to implement a blend of techniques for data governance to manage data quality effectively.
Key Techniques for Data Quality:
The best approach to ensure data quality involves a combination of three methods: decoded lineage, data similarity lineage, and manual lineage mapping.
- Decoded Lineage:
- Overview: This approach focuses on analyzing the code used to process data rather than the data itself. It captures all logic behind data processing and helps understand how data changes or flows through systems.
- Strengths:
- Provides the most accurate and detailed lineage metadata.
- Captures every piece of logic, enabling an understanding of how data evolves.
- Weaknesses:
- It may struggle with supporting multiple programming languages or new technologies.
- Dynamic code (created based on input, data, or variables) is difficult to track.
- Manual database changes (e.g., SQL statements by admins) won’t be detected.
- Inappropriate data handling by code (e.g., violating GDPR) won’t raise alerts.
- Lineage Based on Data Similarity:
- Overview: This method examines the data in databases directly, profiling tables, columns, and using metadata to create lineage based on data similarities.
- Strengths:
- Works regardless of the technology used, since it analyzes the resulting data.
- Weaknesses:
- Requires significant computing resources to identify similarities in large databases.
- Misses details such as transformation logic.
- Can’t track data lineage before processing (e.g., pre-annual processes like revenue adjustments).
- Manual Lineage Mapping:
- Overview: This technique involves capturing lineage metadata through human knowledge, such as interviews with application owners or data stewards.
- Strengths:
- Helps map out prescribed data flow, showing the intended process.
- Weaknesses:
- May conflict if there is disagreement between individuals or be incomplete if not all data flows are known.
Benefits of Lineage Metadata:
Lineage metadata allows companies to:
- Perform impact and root cause analysis by tracing the origin of data and identifying the affected data.
- Standardize business terminology for clearer communication across departments.
- Ensure ownership, accountability, and traceability of changes made to data.
Combining Techniques for Comprehensive Data Lineage:
Since each lineage method has its strengths and weaknesses, a combination of all three techniques is recommended for optimal data quality management:
- Start with decoded lineage (e.g., tools like MANTA, Octopai, or Spline) to capture processing logic and detailed lineage metadata.
- Add lineage based on data similarity using tools like Tamr, Paxata, or Trifacta to identify data patterns and profiles.
- Complete the process with manual mappings to capture intended data flows and governance rules.
By integrating these methods, organizations can ensure comprehensive lineage metadata that will support data governance efforts and help maintain high data quality standards.