Devops teams build their infrastructure as code, automate deployments with continuous integration/continuous delivery (CI/CD), and establish continuous testing as some of the steps to avoid technical debt. Too much technical debt smells like rotting cheese and slows down agile development teams seeking to deliver features and improve application reliability.
“In small amounts, technical debt is useful because it lets you shift focus to urgent things, but you must pay your debts or risk them growing too large,” says Marko Anastasov, cofounder of Semaphore CI/CD.
Data engineering teams looking to improve dataops and data governance should reduce technical debt in their code and automations, while data scientists should evaluate their machine learning models and other analytics code.
Reducing code-level technical debt is not sufficient for data and analytics teams. They must also address data debt by:
- Reducing duplicate data
- Improving data quality
- Identifying dark data sources
- Centralizing master data
- Resolving data security issues
Like technical debt, data debt is easier to identify after its creation. Data debt often requires teams to refactor or remediate the issues before building data pipeline improvements or new analytics capabilities. Implementing best practices that minimize new data debt is harder, especially when teams can’t predict all the future analytics, dashboarding, and machine learning use cases.
Michel Tricot, cofounder and CEO of Airbyte, says, “Debt is not bad. However, debt needs to be repaid, which should be the focus because important decisions will be made with the data.”
Here are six steps data teams can focus on that help avoid or reduce data debt risks.
1. Incorporate governance into analytics capabilities
Devops teams know that addressing code quality, defects, and security issues is much harder once they’ve developed the code, so they seek to shift-left security and quality assurance practices. Similarly, dataops engineers and data scientists should shift-left data governance practices and instill them while building or updating data pipelines, analytics, and ML models.
Joseph Rutakangwa, cofounder and CEO of Rwazi, says having data governance technologies in place can help. “Data catalogs, data lineage tools, and metadata management systems can help organizations manage and track data sources, data models, and data lineage, which can reduce the risk of data debt,” he says. “Data quality tools, such as data profiling and data cleansing tools, can help identify and address issues with data quality, which can help to prevent the introduction of poor-quality data into the data model and reduce the risk of data debt.”
Having technologies in place helps, but data teams must also instill best practices. Michael Drogalis, principal technologist at Confluent, recommends “consciously choosing access patterns, maintaining governance, building in versioning, and distinguishing the source-of-truth data versus derived data.”
Sasha Grujicic, president of NowVertical, adds solutions such as ”standardizing data visualizations, removing unused reports, defining data definitions, implementing data catalogs that alert teams when things need documentation, and instituting data quality procedures.”
2. Assign governance to data and analytics teams
Providing agile data teams with data governance technologies and knowing the best practices is a step in the right direction. Team members must understand their roles and responsibilities around tech and data debt to manage a process of continuous improvement.
Rutakangwa recommends, “Designate data stewardship roles, such as data architects, data analysts, and data engineers.” He says, “Assigning roles helps to maintain data models, ensure data is accurate, and address issues to minimize data debt.”
Grujicic adds, “Organizations can identify and outline the proper data governance structure by adopting a top-down strategy and building a scalable system to support current and future inputs. For most companies, decreasing data debt will reduce risk, lower costs, increase productivity, and establish a foundation for growth for years to come.”
3. Establish trust metrics to drive debt remediations
Data teams focused on addressing data debt should aim to improve trust so when employees review the data, they trust its accuracy and quality. Tricot says, “Determine the level of trust you have in the data using cataloging tools and looking at how many data explorations and production reports rely on specific pieces of data.”
Higher usage levels can indicate trust, but they’re not the whole story. Dataops and governance teams should measure data quality using accuracy, completeness, consistency, timeliness, uniqueness, and validity metrics. Data leaders should also consider surveying leaders and users and developing a data satisfaction score around how well they trust the data, reports, and predictions.
4. Implement data lineage and observability
Low usage, poor data quality, or underwhelming data satisfaction metrics strongly indicate that data debt may undermine how leaders use the data for decision-making. When there’s low trust, dataops teams must work backward to understand the data lineage and how data changes from source to destination. One way to shift-left data lineage is by implementing data observability into every step of the data process.
“Data observability is when you know the state and status of your data across the entire life cycle,” says Grant Fritchey, devops advocate at Redgate Software. “Build this kind of observability when you set up a dataops process to know if and where something has gone wrong and what’s needed to fix it.” Grant also says that data observability helps communicate data flows to business users and establishes an audit trail to support debugging and compliance audits.
Jeff Foster, director of technology and innovation at Redgate Software, adds, “Data observability helps engineers by putting guardrails in place, so data ends up being used in a compliant and ethical way. As we build ever more sophisticated AI/ML pipelines, dataops will be of increasing importance as we seek to understand the data sources used to build large-scale machine learning models.”
5. Beware of data locked into closed systems
Part of data debt is data systems debt, caused when the underlying data management platforms aren’t meeting the business needs. Erik Bledsoe, content marketing manager at Calyptia, says, “Data is irrelevant until it isn’t, and then it is crucial. That’s why you need to be able to process your data, store what is currently relevant in the appropriate back ends, and then route the rest to low-cost storage solutions where it can be retrieved for future analysis.”
Bledsoe recommends seeking vendor-neutral tools supported by open standards. He warns, “Data that can only be accessed by an app you stopped using three years ago is just as bad as not having the data to begin with, and may be even worse since your data is essentially being held hostage.”
One way to avoid lock-in is to automate data extractions from SaaS and other applications and use centralized data platforms such as data lakes or data warehouses for reporting and analytics use cases. Centralized data platforms can also be the source for any platform migration. Archiving older data helps meet compliance requirements without overwhelming data visualization and analytics tools with more data than required.
6. Pick optimal management platforms for data types
One final point around data systems debt is the need for architects to debate the optimal database and data management platforms. Relational databases were the only viable options decades ago, but today, architects can select from graph, key-value, columnar, document, and other database technologies.
Pick a less-optimal data management platform, and the workarounds needed for data analysis can create data debt complexities.
One approach is to see flexible data stores such as data lakes and semistructured data models in graph databases. Victor Lee, vice president of developer experience at TigerGraph, says, “Graph technology helps to reduce data debt by enabling businesses to quickly connect their data in a loose way and then assist in integrating the data more intelligently.”
As organizations seek to be more data driven in decision-making and develop machine learning models for competitive advantages, data teams must address data debt proactively.