Data indexing and discovery

Data indexing and discovery is about:

  • quickly and efficiently finding data across an agency
  • understanding the current data landscape of your agency
  • making best use of existing data.

Efficient data indexing and discovery involves scanning repositories, databases and file stores for datasets. These tasks can be combined with other activities to help understand where existing data may reside and its file naming conventions.

Data indexing and discovery can be used to:

  • speed up a data holdings audit or information review
  • create and update a catalogue of datasets such as an information asset register.

It will assist you to:

  • learn how data flows through an agency
  • document the data lineage of datasets
  • identify opportunities for improvements.

Data lineage documentation can include information on the business rules that are applied to the data as well as how the data flows or is transformed to be in its current state.

Through the use of data indexing and discovery tools, agencies can prioritise the maintenance and remediation of high value data.

Data profiling

Data discovery also includes profiling the data and assessing the data quality. Data profiling, visualisation and search and query tools can be used on large data sets to streamline assessments and provide insight into the root causes of issues with your data. These processes facilitate improvement of a system's architecture and your agency's procedures that detail how you procure, store, manage, use and dispose of data (DAMA, 2017).

Key data profiling tasks include:

  • identifying differences between your data and what you assume it to be
  • researching your current data flow and identifying areas for improvement
  • profiling both source and target data. This helps you determine the data transformation required to match ideal standards of a specific initiative.

Catalogue your agency's data

Knowing what information and data you have in your agency is key to good governance. You should check your relevant governance framework for guidance on where to find the most up-to-date intelligence about your holdings.

You may have:

  • an information asset register that identifies your assets, their potential value and possible risks
  • a data catalogue that documents your agency's datasets and may include their systems, sources and locations
  • a business systems register or software license register that may include details about systems' data – along with any associated system information management plans
  • a metadata repository that stores your agency’s metadata
  • a risk register or system security plan that may also document data holdings.

Automation tools can be used to index your agency’s databases, and can feed into existing registers or catalogues to streamline the process of cataloguing datasets.

Search tools

Other tools that allow users to search for data include:

Copyright National Archives of Australia 2019