Ensuring Data Integrity — Before and After the Migration
Steve Winkler | November 9, 2016
One of the most crucial aspects of successful data migration is ensuring the data integrity at both ends of the migration process.
For data residing in the legacy system, far too many agencies tend to underestimate the amount of inaccuracies, discrepancies, and conflicts that exist. Some older database technologies are remarkably lenient about enforcing rules and constraints when adding or modifying data. And for systems and applications that have been in production for years, bugs that may have been taken care of years ago still may have allowed unexpected data to be recorded in the database.
Problematic data buried in a legacy system may be annoying, but they are not generally the cause of operational problems in and of themselves. A lot of the data is old, and is rarely if ever accessed by the existing system. In fact, many stakeholders may not even know of its existence. However, once you try to migrate malformed data into a new database that has strict policies for what data is allowed, it can create conflicts that slow or shut down your migration process.
Bad data comes in many forms, and can occur when valid data appears in the wrong place (for example a date expected to be “01-01-1999” shows up as “19990101”) or when internally-represented values (such as numeric or floating point data) contain corrupted or unexpected byte sequences. For the migration to proceed smoothly, it’s essential to find these anomalies and figure out how to address them either before or during migration.
Handling problem data
One thing to note: inevitably, certain legacy data (such as the month/date errors cited above) will not make it into the new target system without some intervention. In some cases, platform vendors have been known to advise clients that “bad” data that fails to load should simply be ignored, or deleted from the source system prior to migration. This may be the easy path, but few organizations are going to take it. For some government organizations, it’s not an option at all. For example, agencies involved in law enforcement are usually subject to laws and regulations that require the retention of investigative data for decades, including the bad data. From a legal standpoint, you can’t just pretend the bad data was never there.
We suggest a better alternative: assign valid but conspicuous values as “flags” for problem data in the new system. By doing so, when users encounter the flag in a certain record, they immediately know the migrated record contained information that could not be loaded. For each such instance, persist the original data from the legacy system in a generic format and link it to the new record. This way, the user can quickly check the original source to meaningfully interpret the data.
Show me the proof
Stakeholders will need to be shown that their data is complete and correct once it has been migrated to the new platform — a level of confidence they will not be able to achieve by themselves. By ensuring that the data appears accurately in its new databases and repositories, an agency can prove to stakeholders that post-migration information has absolute data integrity.
When we work with clients, we often use frameworks that apply combinations of validation procedures and automated statistical and analytical tools. The frameworks allow an agency to examine large amounts of the migrated data, comparing and checking against expected results. Statistical techniques provide a general level of assurance that migrated data is complete. If accuracy is a concern, apply checksum methods both before and after migration to efficiently confirm that critical data is correctly represented in the new system.
Automating the data quality review
On both the front and back end of the data migration process, there are mountains of data to test. Because of the data’s volume and complexity, finding and resolving every data anomaly is often easier said than done — and almost impossible if you’re using only manual intervention. This is why it’s so critical to use automated tools, either ones that were included with data profiling software, or ones that you develop on your own (which can be done at relatively low cost). In more challenging or complicated situations, you may need to purchase special case tools.
By effectively testing data integrity and resolving anomalies before data migration, and also testing the validity of data following migration, an organization can ensure that it retains all the information from its legacy system. Just as importantly, it can actually resolve certain longstanding inaccuracies, conflicts, and other lingering data quality issues. As a result, stakeholders and data users can be assured that the data they’re seeing today is not only accurate, but even improved from what it was in the legacy system.