D Step: Document for Curation

Learning Outcomes


Curators will be able to:
  1. identify main goals of having a documentation plan for one’s archive
  2. review typical features of a documentation and digital archiving workflow plan, with a nod to digital preservation principles, illustrated by examples from other DCN institutions.
  3. create a documentation report for their institution, on a “features” form.



Terms to know



The record of deposit made upon addition to the repository collection with relevant cataloging details such as dates and depositor names.

The descriptive metadata associated with the deposited files and dataset covering accession, findability, contributors, and content overviews.

In the context of digital preservation, refers to the history of a digital object including its origins (e.g. depositor name) and any transformations applied to that original object before release.

In the context of digital preservation, refers to the inclusion of all files related to dataset preservation, including documentation, in a single bundled format such as .TAR or .Zip, often including checksum metadata for fixity checks, such as MD5 files.

a unique digital code with a separate reference copy applied to, and preserved with, a given digital object or bundle. Any change in bits checked periodically against the reference copy indicates corruption of the preserved original files, signaling the need to replace it with a backup.

generally a set of metadata fields covering user’s obligations when downloading files, including licensing, such as Creative Commons, citation requirements, confidentiality declarations, and other conditions.



Summary of the Check Step

Throughout the deposit and curation process we are recording the significant treatments or actions applied to the submission. This is distinct from documentation the depositors created to accompany their own datasets. Archives and repositories should document records of the collection’s accession, progress as a project, changes made by curators from the original data submission, and preservation documentation.

 D Step Actions


What is curation documentation?

The goal for dataset documentation is to support the long-term preservation of the data collection by recording curation activities and treatments. At minimum this involves:
   1. Accessioning and deposit records
   2. Repository dataset record metadata
   3. Provenance/change logs
   4. Service workflow
   5. Preservation packaging metadata


Such documentation could be bundled with a preservation copy of the dataset and when using packaging software such as BagIT. This module will illustrate these types of documentation for dataset curation. {start accordion here}

Accessioning and deposit records

Every data repository service needs minimal tracking of deposited data collections. The traditional archival term “accessioning” can refer to the receiving and processing of digital deposits with accompanying recordkeeping. Most repository platforms like Dataverse or DSpace maintain some accessioning information such as deposit dates, but they do not necessarily maintain everything your repository service may wish to track, nor provide summary reports for internal administration.



[Dataverse minimal accessioning information]


For example, ArchiveMatica as a general digital archiving platform supports fairly extensive workflow and accessioning information to track the deposit process; however, currently this platform is not readily adaptable to user-facing research data repositories.

[Define image 1] ArchiveMatica accessioning record view



Repository services may have to set up external systems of tracking on databases or spreadsheets, usually with some manual data entry. Johns Hopkins University Data Services, for example, maintains a spreadsheet, with unique accession numbers and “package name” for datasets, as well as dates and contact information.

[Define image 2] Accessioning spreadsheet



Other ticket-based project management tools such as JIRA can be used to capture some of the accessioning metadata while also tracking steps in the curation workflow. Examples of JIRA will be shown in the section on Service Workflow Documentation.

Checklist of metadata for accessioning and depositing records.

Deposit ID A unique identifier for the collection deposit (usually different than the DOI.)
Preservation filename If the repository creates separate preservation copies or bundles, such filenames are often generated following a naming convention.
Collection title Supplied by the depositor, sometimes with modification suggested by the curator.
DOI Digital Object Identifier registered with DataCite unique to the citable dataset, sometimes generated for individual files.
Depositor Usually the contributing creator of the dataset, but additional fields might designate a person responsible for the dataset over time (such as a project PI) if there are separate depositors as the corresponding contact (such as a student.)
Affiliation Institution and often a department or research group.
Date deposited Date officially received by the repository
Date distributed Generally the date publicly available
Date preserved May be a separate later date when prepared for preservation.
Curator One or more repository staff who managed the deposit and any file transformations and/or correspondence with the depositor.
IRB documentation For datasets from human subject research, fields indicating whether depositors provided consent form sample or other IRB approval and privacy disclosure screening conducted.

Repository dataset record metadata

Much of the crucial metadata about a deposited dataset is captured in the repository collection record itself. Beyond the accessioning information (deposit date) this includes the collection title, citation, DOI, descriptions, authors and other details. For most platforms, this information cataloging the deposit record is visible online to any visitor to the site. Additional metadata may only be visible to repository administrators or staff. Is it a best practice to keep all of that collection metadata locked within that single repository platform? In addition to the platform’s own backup, it may be prudent to export that record metadata for internal administration as well as a reference for users. For administering the repository, exporting and preserving record metadata as a separate file can serve as a backup independent of the platform, and can also be packaged along with preservation copies of the data files themselves. Dataverse, for example, allows export of metadata in several formats, such as DDI, which could potentially be used to reconstitute all deposit records on another platform. [Image - Repository metadata export, Dataverse]



Cornell builds a curation log for internal use partly from platform-supplied information but manually adds additional curation details, and adds the file to the download collection. This approach blends the user-facing information and administrative documentation for preservation. [image alt-text ] Cornell README that includes record metadata.



For repository visitors, the files they download may not necessarily include a “README” document that includes that record information, especially the crucial citation, description, authors, and terms of use. Users would not have those details as a “take away” kept with the downloaded files if they never return to the original repository collection. Depending on the repository’s deposit methods, curators may rely on researchers to supply a README. If it is feasible to supply that record summary to the depositor as a README starter, the depositor may be encouraged to add additional details about the datasets and their use. Alternatively, curators might also add such details to a README, and create one when not supplied by the depositor. Some platforms generate a deposit record summary automatically to be included with all download files. U. Michigan’s Deep Blue data repository generates README-style record documentation automatically included with the downloaded files. [Alt-text: U. Michigan record documentation accompanying dataset downloads.]



Checklist for documenting repository record metadata

If a README metadata record is generated for repository users to download with data files, essential “take away” information could include:
  • Data citation (Title, authors, date, publishing repository, DOI link)
  • DOI or link to deposit
  • Depositor contact information
  • Date published
  • Description of dataset (or associated publication abstract.)
  • Associated publication citation and link
  • Rights license and terms of use
  • File list with additional description

Provenance/change logs

Provenance is another term adapted from traditional archiving. It refers to documenting the history of a digital object including its origins (e.g. depositor name) and logging any transformations applied to that original object before release. Subsequent version changes as well as ownership or location changes (e.g. moving to another repository) should also be documented with the data collection provenance. Such change logs provide another layer of context to the preservation package for a collection, with version tracking being especially important for collection management. Again, few repository platforms automatically capture all aspects of provenance, so repositories may have to devise external workflows for documentation, either automated or manual. Cornell’s repository system includes fields for recording each change during the curation process, that can be included in the record metadata README document included with the data file downloads. Image: Cornell provenance change logs.



Johns Hopkins creates a simple table in MS Word that is included with the preservation copy of the data files.



Checklist for provenance log

Checklist of essential or minimal elements to capture for the various types of metadata.
  • Deposit identifier/record name
  • Change date
  • Curator
  • Action taken (description)
  • Versioning naming convention (post publication)

Service workflow

The service workflow refers more generally to the overall handling of a dataset as a project or task, from initial inquiries from a depositor and subsequent email exchanges, to the deposit and curation steps, to publication and maintenance. Some repository institutions take a “project management” approach to deposits following deposits “received,” “in process,” and “completed” with dates, curators assigned, and other tracking details. Workflow documentation is mainly useful for repository management but it may be relevant to capture some workflow elements for the dataset deposit record and provenance for preservation. Some digital archiving platforms such as Archivematica have built-in service workflow tracking, but often repositories may need to adapt general purpose project management software and build customized workflows. Request tracking and ticketing systems such as JIRA, ORTS or Request Tracker could be adapted from their typical IT helpdesk or code development functions. For example, the Data Curation Network uses JIRA to track curation help requests.



Tickets or “Issues” for JIRA can be customized with fields relevant to data deposits, such as U. Michigan’s:



U. Illinois’ system also captures email exchanges:



Service workflow ticketing systems may not be sufficient in themselves for tracking the deposit and accessioning records of completed projects. A “distributed” approach might use JIRA in combination with other records in multiple locations, such as spreadsheets and archived Outlook email exchanges. JHU, for example, is supplementing their spreadsheet system, mentioned earlier, with JIRA tracking for in-process deposits. [JHU image]

Checklist for service workflow

Service workflow focuses on tracking the stage in the process from accession to publication/preservation. A given deposit’s “ticket” might include all or part of the deposit record metadata and accessioning information but should have relevant “at-a-glance” information which could include:
  • Deposit ID/filename
  • Depositor
  • Status (e.g., Intake, curating, publishing, closed)
  • Curator (assigned to)
  • Relevant activity dates (deposit, curate, publish)
  • Last activity/modification date
  • Notes
  • Work time log (optionally)

Preservation packaging metadata

A final step of documentation can occur at the preservation stage of dataset archiving. Data files can be bundled with all relevant documentation to form a “package” ready for long-term storage ideally capable of restoring the full data collection. Certain digital archiving software produces self-contained packages bundling files, descriptive metadata, and preservation metadata such as md5 checksum files for fixity checks. BagIt is a leading digital archiving protocol that, in addition to preservation metadata, allows additional descriptive metadata files that could include accessioning and deposit records. The Data Conservancy Package Tool developed by Johns Hopkins, for example, uses the BagIT protocol to generate a “bag-info” file packaged with data into a single TAR file for their preservation system.




Other programs like ArchiveMatica leverage this ability to add preservation metadata to the BagIt package. Even without such systems, it is a best practice to keep as much documentation together with preserved files, including provenance and preservation metadata, to create a more complete backup of the archive and curation process.

Checklist for preservation packaging

Much of the deposit metadata and accessioning documentation discussed so far could be preserved with the files as part of the package. Documentation that might be included as part of the preservation package could include:
  • Accessioning records
  • Deposit metadata
  • Provenance records
  • Depositor correspondence/emails (optional)
Special preservation metadata might include:
  • Bagit info files
  • Md5 checksum manifest files for fixity checks
  • Special record IDs and/or file naming for preservation packages (not used elsewhere)
  • Versioning information for updated files.


Key Ethical Considerations

  • Document that disclosure risk review has taken place. State if changes from original data have been made, but do not give enough detail on changes to reverse-engineer any anonymization.
  • Include consent (or waiver) and/or IRB approval of sharing with administrative documentation.
  • Consider collecting contributor demographics.

Activity

This module covered several aspects of documentation recommended for administering an institutional data repository.

Activity Materials

Directions

This exercise is about articulating your current local institutional repository documentation strategy, or desired approach if you do not currently have a data archive or curation program. Using the checklist below, draft answers to how you are currently, or would like to, in a future state document your curation, preservation, and data related activities.

Activity Checklist


Documentation Type Description Questions to Address
Accessioning and deposit records Accessioning here refers to documentation of data deposits and processing. Minimal information may include deposit IDs or filenames following a naming convention, relevant dates, depositor names and curator names. Other deposit record metadata might be captured, but of relevance here is your method of assembling and viewing the lists of deposits’ essential information, their history and status. Are such tracking and reporting features in your repository platform (or ideal system) or via spreadsheets or other means?
Repository dataset cataloging metadata “Cataloging” metadata refers here to the information about each dataset record, including metadata visible to viewers of the repository and internal administrative metadata. The accessioning records may track some of this metadata, but does your repository export and preserve all or part of the catalog record outside of the platform itself? (Dataverse, for example, exports metadata in several formats.) Do you have, or plan, a version of the catalog record for users that exports along with the data as a README?
Provenance/change logs The provenance of a digital object can refer to its record of changes from original deposit to transformations during curation before release. Subsequent version changes, location changes, can also be relevant to the collection’s long term preservation in addition to accessioning metadata. How does (or would) your repository capture the history of transformations during curation and/or updates after publication?
Service workflow Archive deposits follow a workflow from requests, deposits, curation, publication that can be tracked like a project management task. All repositories document this workflow in some form. Sometimes this is done manually such as through spreadsheets or calendars. Some repository platforms track at least part of a workflow, such as deposit dates and release status. Project management or request ticketing systems such as JIRA can also be adapted to much of the workflow documentation How does your repository track workflow? What software do you use, or would explore using?
Preservation packaging metadata What is your repository’s method or planned method, for preserving archived files and associated metadata?

Do you/will you have a separate preservation copy of files from user-facing files in the platform?

Do you/will you package files and relevant documentation together using protocols such as BagIt? Describe, and also mention the preservation format (e.g.,.TAR or .ZIP) Is the correspondence/email with depositors preserved, and how?

Does your preservation package include special preservation metadata such as MD5 records for fixity checks, special preservation IDs or filename conventions, or other descriptive metadata besides the catalog and accessioning metadata?

Review of the Document Step (D Step)

This module discussed several aspects of documentation beyond what depositors provide, that help with administration and workflow of an institutional data repository. Traditional archiving and project management supplies some relevant terminology for what documentation to collect and preserve, including:
  • Accessioning records for the receiving and processing of deposits.
  • Deposit record metadata exported from the platform.
  • Provenance of all changes from the original dataset and post-publication updates.
  • Workflow tracking of ongoing deposits, and optionally, correspondence, with a ticketing system or a similar project management approach.
  • A preservation package of files and documentation including metadata specific to preservation, such as md5 fixity checksum data.

The exercise is an opportunity to record, develop, and potentially share documentation plans for your institutional repository. These five types of documentation could ideally be sufficient to reproduce the archive apart from its platform, as a best practice for long-term digital preservation.


Additional Resources

Resources referenced in this guide and related to dataset documentation: