T Step: Transform File Formats


Learning Outcomes

Curators will be able to:

  1. Describe what transformation is, and why it is important.
  2. Identify common file formats to maximize long-term access and preservation.
  3. Use different tools available to help with file transformation.
  4. Identify challenges and benefits to file transformation.



Terms to know




The act of making information available. To increase ease of access, data should be made available in a convenient and modifiable form.


Content that is accessible is designed and developed so that people with disabilities can use it. For data curators, accessibility can include technical requirements that facilitate access for people with a diverse range of hearing, movement, sight, and cognitive ability (e.g. formatting that is compatible with screen readers), as well as requirements that facilitate user interactions (e.g. understandable instructions remove barriers to access, understanding and reuse of the data). Curating with accessibility in mind can improve data for all future users.


The migration of information from one file format to another, usually for purposes of preservation or access.


Data formatted using a disciplinary standard for better integration with other datasets and/or systems.


Ensuring that data remain intact, accessible and understandable over time. This requires preserving the integrity of digital files themselves, and can be very complicated. Preservation actions may include preserving the software required to interact with the data or emulating older systems, migrating data to new formats and new media, and ensuring there is sufficient metadata to understand, interpret, manage and preserve the data.


A proprietary format is a file format of a company, organization, or individual that contains data that is ordered and stored according to a particular encoding-scheme, designed by the company or organization to be secret, such that the decoding and interpretation of this stored data is easily accomplished only with particular software or hardware that the company itself has developed. (Wikipedia)



Summary of the Transform Step

Transformation is the process of converting data from one format to another (e.g. from a proprietary format like Microsoft Excel to a non-proprietary format like csv). Transformation may be recommended for several reasons including increasing access, interoperability, and likelihood of long term preservation. Curators may convert files themselves or request that the depositor convert files. Sometimes both original and converted files should be deposited, (e.g. some file types may be good for long-term preservation, but functionality is lost, so it’s recommended to deposit both versions).


T Step Actions

  1. Decide if current formats meet access and preservation requirements.
    a. (See examples of common proprietary formats and suggested formats for preservation in Table 1: Common Transformations below).
  2. Convert or ask depositor to convert files to formats that increase access, interoperability, and likelihood for long-term preservation.
  3. Decide if both original and converted files will be part of the upload.
  4. Update documentation to reflect transformation notes, as necessary.



Table 1: Common Transformations

Native Software or Format

Suggested Formats or Transformations

Transformation Tools and Notes

CZI (microscope images), Photoshop TIFF, JPG, FITS Use “export”; Omero, Bioformats; WikiData tracks software and file formats for preservation
Microsoft Word PDF, TXT, HTML Use “save as”; use accessibility checker to maximize accessibility.
Microsoft Excel / XLS, XLS CSV, TSV Use “save as”; Use Excel Archival Tool to preserve formulas
Microsoft Access DBF Use “save as”; retain original to ensure full functionality.
Chemdraw / CDX CDXML, MOL, JPG, 001, OPJ, TRI Retain original. Some conversions will result in loss of information.
PDF PDF-A Use “save as”
MP4, MOV, WMV Uncompressed AVI or MOV + captions No information is gained going from a “lower” resolution image to a “higher” one, but long-term access may be improved. Use YouTube, Vimeo, Kaltura or other tools for captioning.
Windows Media (audio, music files) WAV, MP3 Free audio converters are available, or use iTunes or Windows Media Player to convert files.
.SHP (geocoded xls) CSV + extracted metadata Retain .SHP. Use FME Tool or ArcGIS
Webpages WARC, TIFF [Link to Internet Archive.] Provide screen shots.


T Step Checklist


  
        
        
  
  
  

Key Ethical Considerations

  • Consider how best to navigate researcher bandwidth limitations and ownership of data with repository commitments to reducing barriers to reuse.
  • Decide how to balance the potential benefits of transformation with the risks of mistakes and loss of content/context, especially if curator or repository will be performing transformation. Document the decision.

Activity: Excel Archival Tool

The Excel Archival Tool programmatically converts Excel files to open source formats (specifically, CSV and PNG) in preparation for archiving. You will download, install, and use the Excel Archival Tool (EAT) to transform a Microsoft Excel file to archival CSV and PNG files. The Excel Archival Tool is only available for Windows platforms; other platforms will either have to perform the file transformations manually or obtain a Windows machine to do this exercise.

The Excel Archival Tool:

  • Automated conversion process for Microsoft Excel → CSVs but also captures
    • Charts and figures exported as PNGs
    • Formulas exported as text files
    • Cell formatting and style preserved as html snapshot of spreadsheet
  • Generates a report on the archival outputs

Activity Materials:

The Excel Archival Tool requires a Windows environment to run (Windows XP, 7). The GUI version (WithUI) requires Internet Explorer. You may want to perform the file transformations manually or borrow a Windows machine to do this exercise if you use a non-Windows machine.

  1. Download Excel Archival Tool from GitHub: http://z.umn.edu/exceltool
    1. We suggest that you use the WithUI version of the tool for this exercise.
      1. After downloading the ZIP, extract the WithUI folder.
    2. Remember where you extract these files, so you can access the tool during the exercise.
  2. Download excel file “Microsoft Excel data file, all figures [Microsoft Excel]” from https://hdl.handle.net/1813/43783.
  3. Create a folder for your EAT output; this is not a requirement, but will simplify the process.

Directions:

  1. Make sure Excel is not running on your computer.
  2. Launch EAT by double clicking the ExcelAchivalTool.HTA from within the WithUI program folder.
  3. Select the location of the input (Ostwald_etal_WaterHomeostatis2016_Data.xls) file.
  4. Select the output location (created during setup).
  5. Select desired outputs. Suggested outputs for this exercise:
    1. Raw Spreadsheet Data (csv)
    2. Cell Formulas (txt)
    3. Charts and Figures (png)
  6. Review your outputs. What should you have? EAT should have created:
    1. One CSV file for each of the worksheets in the workbook. In this example, it should have created 10 CSV files.
    2. One Folder named “Formulas”; inside there should be one folder for each file. Each file folder will have one TXT file for each worksheet that had formulas in it. In this example, it should have created two TXT files.
    3. One “Output Report.txt” file containing a summary of the output for that EAT run.
  7. Bundle as Archival File for repository upload (zip, tar, etc., according to repository policy).



Activity: Transformation Actions

Activity Materials

Dataset, either the same one you used for the EAT activity, or one of your choosing.

Directions

  1. List possible format transformations for your datasets
  2. Put yourself in the role of different dataset stakeholders (researcher, curator, archivist, consumer, publisher).
    Consider how any transformation benefits different stakeholders
    List some of the challenges to any particular format transformations or stakeholder perspectives.

Bonus question: At what level has your institution promised to “preserve” the data, and what are the implications of that policy in practice?




  1. List possible format transformations for your datasets.
    Answer: excel file should be converted to csv, figures to jpg, and formulas and other information preserved as a txt file.
  2. Put yourself in the role of different dataset stakeholders (researcher, curator, archivist, consumer, publisher).
    1. Consider how any transformation benefits different stakeholders.
      The researcher may benefit from increased accessibility for future re-users, resulting in increased re-use and more citations. Future consumers benefit by getting the data in a format that can easily be imported into multiple different tools. Curators and archivists benefit by having the data in a format that is more likely to be preserved in the long-term.
    2. List some of the challenges to any particular format transformations or stakeholder perspectives.
      The researcher may not want to take the time to perform transformations. Future consumers may not understand how to use preservation formats, making it important to consider also including analysis-friendly formats in the deposit. For some proprietary file formats, no other format is available.

Bonus question: At what level has your institution promised to “preserve” the data, and what are the implications of that policy in practice?
Read over your preservation support policy (or this example from Cornell’s eCommons digital repository) and consider what effect file transformations, or the lack thereof might have on the repository’s ability to preserve the data. Are files that are only able to be opened by a proprietary software that is no longer maintained or available preserved?




Additional Resources

Excel Archival Tool from GitHub: http://z.umn.edu/exceltool - The Excel Archival Tool programmatically converts Excel files to open source formats (specifically, CSV and PNG).

McGrory, John. (2015). Poster for "Excel Archival Tool: Automating the Spreadsheet Conversion Process". Retrieved from the University of Minnesota Digital Conservancy, http://hdl.handle.net/11299/171966.

Module 3 Understand: more information about proprietary file formats, software version documentation, and other important actions for understanding the data.

Janée, Greg; Sawchuk, Sandra; Yoo, Ho Jung. (2019). Microsoft Excel Data Curation Primer. Data Curation Network GitHub Repository.

Smithsonian Institution Archives. Smithsonian Recommended Preservation Formats for Electronic Records. https://siarchives.si.edu/what-we-do/digital-curation/recommended-preservation-formats-electronic-records.

Cornell University Library. File formats for digital content: Probability for full long-term preservation, in Recommended File Formats. https://guides.library.cornell.edu/ecommons/formats