U Step: Understand the Data Deposit

Learning Outcomes


Curators will be able to:
   1. Identify different terms and processes associated with the Understand step of the CURATED model.
   2. Assess a dataset and its component files as a complete package.
   3. Engage in an activity to practice the Understand step on the dataset identified in the previous Check step.
   4. Reflect on what might be necessary to enhance understanding of the data package in preparation for the Request step: Request missing information.



Terms to know




The path refers to the location of a file in the directory structure of where it is stored.

     a. Absolute paths provide a full list of all of the folders from the beginning (or “root”) of the storage unit. On most unix based systems, the root directory is “\.” On Windows systems, the root directory usually begins with a drive letter such as “C:\”.

     b. Relative paths provide a list of folders that begin at a designated folder (usually the initial folder of a project). In the case of curating for secondary use, relative paths are preferred for long term preservation since it is generally easier to share and preserve the initial project folder and all subsequent folders.


A codebook provides a description of all the items contained in the data collection with information about individual files, measures, and codes used to represent those files. A codebook will often provide additional context about the data collection process, assumptions, requirements, and descriptive statistics that enable a secondary user to understand the context for the collection while also validating the integrity of the data collection.


Documentation embedded in the computer programming code that is ignored by the interpreter or compiler when the computer program executes.


A list of the elements contained in a dataset and their position in the data file. Each file in a data submission may have its own data dictionary.


A data file in which each data element is separated by a common character. Comma separated values (.csv) files are very popular, but tab separated values (.tsv) files and pipe delimited (|) files are also used in many data projects.


Software code that requires the presence of certain files (file dependency) or software libraries for the program to execute. Some dependencies may require a particular version of a software code for execution.



Summary of the U Step: Understand the Data Deposit

After Checking the contents of the data deposit, the Understand step of the curation process requires a deeper dive into the individual items submitted for curation. The curator should review whether these items form a cohesive package that would allow someone other than the original researcher to be able to understand what is being presented. Perhaps the most important information to look for during this process is the presence of contextual and content-specific documentation, file dependencies, and potential ethical issues that could prohibit publishing openly in a repository.



U Step Actions

Below are some actions you might perform during the Understand step. Actions will vary depending on the subject matter and data type you are curating. The Data Curation Network’s Data Curation Primers provide a helpful resource for better understanding new data types and formats. Primer topics cover a wide range of formats, software platforms, and data types.


Check for:

Questions to ask:

Overall Quality Assurance
  • Are there missing data?
  • Could a user with similar qualifications to the author’s understand and reuse these data?
  • Does the provided code execute without errors?
  • Are the data, documentation, and/or metadata presented in a way that aids in interpretation?
Documentation Depth
  • Is the context of the data explained? (Methodology information, relevant citations, file relationships, etc.)
  • Is the content of the data explained? (variable and value labels, units of measurement, etc.)
  • Is there additional documentation that may be helpful based on the data type? (e.g a codebook, data dictionary, study protocol, survey questionnaire(s) etc.)
  • If code is present, is it commented (e.g. explains what each chunk of code is supposed to produce)?
Data Structure
  • If spreadsheet data are present, are there multiple sheets? The data on multiple sheets should be documented and stored as a separate delimited file (to ensure usability of each sheet rather than relying on the original software platform, which could become obsolete).
  • Are there embedded calculations or other software dependencies?
  • Are there file references/links to other files in the package (are all the files there? correctly referenced?)
  • If code is present, are referenced file paths relative (easy to change based on where the data is located once downloaded)?
Ethical Issues Assessment
  • If human subject data are included:
    • Is a consent form present that allows for data sharing?
    • Are there any direct or indirect identifiers that could reveal the identities of those involved in the data project? If unsure, this should be a question for the data provider in the Request step.
  • If there are geographic data included:
    • Are there direct location identifiers (addresses, geographic coordinates, placenames, etc.) that could pose a risk to persons or places that could pose a risk to research participants, endangered species, sensitive archeology sites, etc.?
  • If data were obtained from another source, is permission listed to redistribute/republish?


U Step Checklist


Key Ethical Considerations

  • If working with human data, is this research done with and not on communities and populations involved? (You may wish to review data sources, researchers, and their connections to the communities and subjects they are serving to facilitate further conversation with researcher(s).)
    • Are there authoritative group representatives who should be contacted in the next (request) step?
  • Are there labels or other descriptive indicators that could be applied to better represent or protect an identified group of people impacted by this data deposit? (Example: TK labels)


Activity


Activity option A:

Use the example data deposit . These files are in Microsoft Excel and Word format.

Activity option B:

Download a dataset from Figshare that aligns with your own subject area expertise. Look over the dataset and determine if you, or someone with skills similar to the researcher, would have enough information to understand what is presented.

Materials Needed (both A and B):

Software available to open your chosen dataset and accompanying files (or ability to convert to a readable format)

Directions (both A and B):

   1. Open the files and assess what you see using the “U Step Actions” described above.

Active Reflection:

  1. How “understandable” do you think this dataset is?
  2. What would make this dataset more understandable? The answers you provide here will be helpful for the next step in the CURATED process.
  3. How do you see yourself using what you learned in your own practice?


Additional Resources

Types of Documentation:

README file

Codebook

Commented Code

Working With Various File Formats:

When curating data for a repository that accepts all types of data, you can receive many different types of files. As a result, you might not always have the requisite software needed to open and view the files. When this occurs, there are a few different ways to still be able to read the files using common, readily available software.

Common proprietary formats you might encounter include (but are not limited to) MATLAB (.mat, .m), Stata (.dta, .dct, .do), SAS (.sas, .sas7bdat), SPSS (.sav, .sps), ESRI/ArcGIS (.shp, .dbf, .gdb).

For some proprietary formats, there are open source, freely available software packages that can work with them. For example, QGIS can be used to work with files created in ESRI’s ArcGIS platform. For others, you may have to convert the files. A useful tool for conversion is called Stat/Transfer . It is not freely available, but can be worth the investment given that it also helps with older legacy file formats.

Notepad++ is a free source code and text editor. It is an exceptionally helpful tool when working with text files that appear unstructured when opened with regular Notepad or Wordpad. It can also often be used to open code files such as .m, .r, .do, .sas. Notepad++ is also worth trying when a file appears to not have any extension.

Curating human participant data can be challenging. The Data Curation Network has a Primer on Human Participants Data Essentials that can help inform that process.