The act of making information available. To increase ease of access, data should be made available in a convenient and modifiable form.
Learning Outcomes
Curators will be able to:1. Describe what transformation is, and why it is important.
2. Identify common file formats to maximize long-term access and preservation.
3. Use different tools available to help with file transformation.
4. Identify challenges and benefits to file transformation.
The act of making information available. To increase ease of access, data should be made available in a convenient and modifiable form.
Content that is accessible is designed and developed so that people with disabilities can use it. For data curators, accessibility can include technical requirements that facilitate access for people with a diverse range of hearing, movement, sight, and cognitive ability (e.g. formatting that is compatible with screen readers), as well as requirements that facilitate user interactions (e.g. understandable instructions remove barriers to access, understanding and reuse of the data). Curating with accessibility in mind can improve data for all future users.
The migration of information from one file format to another, usually for purposes of preservation or access.
Data formatted using a disciplinary standard for better integration with other datasets and/or systems.
Ensuring that data remain intact, accessible and understandable over time. This requires preserving the integrity of digital files themselves, and can be very complicated. Preservation actions may include preserving the software required to interact with the data or emulating older systems, migrating data to new formats and new media, and ensuring there is sufficient metadata to understand, interpret, manage and preserve the data.
A proprietary format is a file format of a company, organization, or individual that contains data that is ordered and stored according to a particular encoding-scheme, designed by the company or organization to be secret, such that the decoding and interpretation of this stored data is easily accomplished only with particular software or hardware that the company itself has developed.
(Wikipedia)
Transformation is the process of converting data from one format to another (e.g. from a proprietary format like Microsoft Excel to a non-proprietary format like csv). Transformation may be recommended for several reasons including increasing access, interoperability, and likelihood of long term preservation. Curators may convert files themselves or request that the depositor convert files. Sometimes both original and converted files should be deposited, (e.g. some file types may be good for long-term preservation, but functionality is lost, so it’s recommended to deposit both versions).
1. Decide if current formats meet access and preservation requirements.
a. (See examples of common proprietary formats and suggested formats for preservation in Table 1: Common Transformations below).
2. Convert or ask depositor to convert files to formats that increase access, interoperability, and likelihood for long-term preservation.
3. Decide if both original and converted files will be part of the upload.
4. Update documentation to reflect transformation notes, as necessary.
Native Software or Format |
Suggested Formats or Transformations |
Transformation Tools and Notes |
---|---|---|
CZI (microscope images), Photoshop | TIFF, JPG, FITS | Use “export”; Omero, Bioformats; WikiData tracks software and file formats for preservation |
Microsoft Word | PDF, TXT, HTML | Use “save as”; use accessibility checker to maximize accessibility. |
Microsoft Excel / XLS, XLS | CSV, TSV | Use “save as”; Use Excel Archival Tool to preserve formulas |
Microsoft Access | DBF | Use “save as”; retain original to ensure full functionality. |
Chemdraw / CDX | CDXML, MOL, JPG, 001, OPJ, TRI | Retain original. Some conversions will result in loss of information. |
PDF-A | Use “save as” | |
MP4, MOV, WMV | Uncompressed AVI or MOV + captions | No information is gained going from a “lower” resolution image to a “higher” one, but long-term access may be improved. Use YouTube, Vimeo, Kaltura or other tools for captioning. |
Windows Media (audio, music files) | WAV, MP3 | Free audio converters are available, or use iTunes or Windows Media Player to convert files. |
.SHP (geocoded xls) | CSV + extracted metadata | Retain .SHP. Use FME Tool or ArcGIS |
Webpages | WARC, TIFF | [Link to Internet Archive.] Provide screen shots. |
Activity: Excel Archival Tool
The Excel Archival Tool programmatically converts Excel files to open source formats (specifically, CSV and PNG) in preparation for archiving. You will download, install, and use the Excel Archival Tool (EAT) to transform a Microsoft Excel file to archival CSV and PNG files. The Excel Archival Tool is only available for Windows platforms; other platforms will either have to perform the file transformations manually or obtain a Windows machine to do this exercise.
The Excel Archival Tool:
- Automated conversion process for Microsoft Excel → CSVs but also captures
- Charts and figures exported as PNGs
- Formulas exported as text files
- Cell formatting and style preserved as html snapshot of spreadsheet
- Generates a report on the archival outputs
Activity Materials:
The Excel Archival Tool requires a Windows environment to run (Windows XP, 7). The GUI version (WithUI) requires Internet Explorer. You may want to perform the file transformations manually or borrow a Windows machine to do this exercise if you use a non-Windows machine.
- Download Excel Archival Tool from GitHub: http://z.umn.edu/exceltool
- We suggest that you use the WithUI version of the tool for this exercise.
- After downloading the ZIP, extract the WithUI folder.
- Remember where you extract these files, so you can access the tool during the exercise.
- Download excel file “Microsoft Excel data file, all figures [Microsoft Excel]” from https://hdl.handle.net/1813/43783.
- Create a folder for your EAT output; this is not a requirement, but will simplify the process.
Directions:
- Make sure Excel is not running on your computer.
- Launch EAT by double clicking the ExcelAchivalTool.HTA from within the WithUI program folder.
- Select the location of the input (Ostwald_etal_WaterHomeostatis2016_Data.xls) file.
- Select the output location (created during setup).
- Select desired outputs. Suggested outputs for this exercise:
- Raw Spreadsheet Data (csv)
- Cell Formulas (txt)
- Charts and Figures (png)
- Review your outputs. What should you have? EAT should have created:
- One CSV file for each of the worksheets in the workbook. In this example, it should have created 10 CSV files.
- One Folder named “Formulas”; inside there should be one folder for each file. Each file folder will have one TXT file for each worksheet that had formulas in it. In this example, it should have created two TXT files.
- One “Output Report.txt” file containing a summary of the output for that EAT run.
- Bundle as Archival File for repository upload (zip, tar, etc., according to repository policy).
Activity: Transformation Actions
Activity Materials
Dataset, either the same one you used for the EAT activity, or one of your choosing.
Directions
- List possible format transformations for your datasets
- Put yourself in the role of different dataset stakeholders (researcher, curator, archivist, consumer, publisher).
Consider how any transformation benefits different stakeholders
List some of the challenges to any particular format transformations or stakeholder perspectives.
Bonus question: At what level has your institution promised to “preserve” the data, and what are the implications of that policy in practice?
Bonus question: At what level has your institution promised to “preserve” the data, and what are the implications of that policy in practice?
Read over your preservation support policy (or this example from Cornell’s eCommons digital repository) and consider what effect file transformations, or the lack thereof might have on the repository’s ability to preserve the data. Are files that are only able to be opened by a proprietary software that is no longer maintained or available preserved?
Excel Archival Tool from GitHub: http://z.umn.edu/exceltool - The Excel Archival Tool programmatically converts Excel files to open source formats (specifically, CSV and PNG).
McGrory, John. (2015). Poster for "Excel Archival Tool: Automating the Spreadsheet Conversion Process". Retrieved from the University of Minnesota Digital Conservancy, http://hdl.handle.net/11299/171966.
Module 3 Understand: more information about proprietary file formats, software version documentation, and other important actions for understanding the data.
Janée, Greg; Sawchuk, Sandra; Yoo, Ho Jung. (2019). Microsoft Excel Data Curation Primer. Data Curation Network GitHub Repository.
Smithsonian Institution Archives. Smithsonian Recommended Preservation Formats for Electronic Records. https://siarchives.si.edu/what-we-do/digital-curation/recommended-preservation-formats-electronic-records.
Cornell University Library. File formats for digital content: Probability for full long-term preservation, in Recommended File Formats. https://guides.library.cornell.edu/ecommons/formats