Data Governance, is a best practice for the management of an effective analytics program. The work of Bill Immon and Ralph Kimball pointed out the need to involve all levels of the organization as a component of successful analytics effort. The LogiXTech team wondered how the principles, of Data Governance, could be adopted in the era of Big Data and Machine Learning.
Today’s analytics strategies are based on the reality that large volumes of data can be readily obtained from multiple sources. The volumes of data and, the rate at which data is generated, is greater than the pioneers of Data Governance could imagine. Traditional approaches to data staging and cleansing create process constraints while implementing a big data analytics strategy.
To gain insights on Data Governance, as it relates to Big Data and Machine Learning, we conducted a survey of the literature from thought leaders in this area. The principles of Data Governance are still relevant in the current era of Big Data and Machine Learning. However, the approaches to implementing data governance have changed. In this post we will summarize the recommendations and best practices from industry leaders.
The Benefits of Data Governance
Data Governance supports strategic analytics initiatives. Unlike transactional or, operational reporting, analytics efforts are designed to measure performance and progress towards strategic goals and initiatives.
Data Governance consists of a series of inter-related processes. The processes bring together people, processes and, technology with the aim of creating a shared understanding of how data is: utilized; acquired; managed; and used across the organization. The benefits of Data Governance include:
- Ensure compliance with key internal and external regulations for the use and, protection of consumer or, other identifiable data
- Aid in the standardization of data systems, policies, procedures, and standards through clear understanding of where and how data is acquired and incorporated into the analytics program.
- Decrease costs related to data management
- Increased transparency for any data activities through the understanding of data sources and data acquisition processes
- Increased overall revenue through lower occurrence of errors and measurement of the organization’s performance
- Improved operational efficiency owing to a better capacity to plan
- Improved data qualiy
Data Governance Participants
Effective Data Governance strategies require participation from all levels of the organization. Whether we are talking about a data warehouse project, a big data or, machine learning effort, participation across the organization is a fundamental requirement.
Common roles include:
- Executive Stakeholders: Responsibilities include creating the vision and outlining the metrics that will be measured and evaluated by the effort
- Data Architect: The Architect is a senior technical staff member who has a broad understanding of the internal and external data sources, the capabilities of the systems and, how data can be brought together to support the effort
- Data Owners: Departmental leaders who understand the data generated through day to interactions and operation of their business unit
- Data Stewards: Departmental members who are subject matter experts in how data is processed and stored in operational systems
- Technical Data Stewards: IT Staff who work with the Data Owners and the Data Stewards to acquire and aggregate data for the analytics effort
AI’s Role in Data Quality
Traditionally, data quality processes depended on the manual creation of programs to address known data issues. Through the Data Steward’s expert knowledge, common gaps or errors, in the source system’s data, were identified and addressed through data cleansing programs. Exceptions, to the defined process, were identified and required additional manual intervention to refine and adapt the data quality process.
The volume of data generated by interactions with a website or, connected Internet of Things devices, makes this approach impractical. For a machine learning solution algorithm can be created to address and mitigate data quality issues. Algorithms can be trained to: identify common data elements among data sets; classify and group data acquired from multiple systems; and correct incomplete data based on patterns identified in the data.
Recommended approaches, from our review of the literature, describe a step wise approach when developing a machine learning approach to data management and data quality:
- Think big but start small: Identify a single business case and begin building processes around the business case. The lessons learned from managing data for the initial business case can be applied as the scope of the effort expands
- Iteratively increase the autonomy of machined managed data quality processes:
- Monitor: Humans are actively involved in sampling and monitoring the effectiveness of the data management processes. Create a structure for unstructured data
- Coaching: The accuracy of the data quality algorithm is monitored and scored by humans. The human accuracy scores are used to enhance the data management algorithms
- Collaborating: The algorithm takes increased responsibility for managing, correcting and structuring the data. Humans are involved when the processes detect elements or issues it does not understand.
- Autonomy: The algorithm can confidently address data management and data quality issues. Monitors and Alerts are provided to for humans to asses the effectiveness of the processes
- Develop Key Performance Indicators (KPI’s) associated with the business case. Metrics are essential to tracking the progress towards the strategic goal. Review of the KPI’s identifies ways to improve measurement of the KPI into the future.
- Develop a communication plan. Regularly report progress and the results of the analytics effort to the stakeholders and data owners. Regular communication keeps the organization engaged with the effort and provides additional input for refining KPI’s and Metrics as the effort moves forward.
- Create an awareness that this is an ongoing effort. One of the pitfalls of a data governance issue is the participants perception that this is a one-time project. Part of the communications must establish that this is a process shift that will continue into the future
Drs. Gasser and Almeida of Harvard University advocate creating a layered model for the governance of autonomously operating machine learning and AI algorithms. The layers include:
- Social Considerations
- Societal Norms and Legal constraints for the use of data by the Machine Learner
- Ethical Considerations
- Organizational Definitions of the criteria and appropriate use of data
- Data Governance
- In this model Data Governance is a foundational component that supports the application of the Social and Ethical constraints for the use of data by Machine Learning or AI processes
Big Data, Machine Learning and AI can create new levels of insight and efficiency for organizations. An effective Data Governance program is an essential component for the successful application of these technologies to an analytics strategy.