Best practices

1. Data Management Plan

đź“ť Make a data management plan (DMP) from the start and you will be far ahead. Here are some essentials to think about when setting up a plan.

1. Types of data

  • What types of data will you be creating or capturing: experimental data, observational data, model simulations, retrieval of existing data?
  • How will you capture, create, and/or process the data? (e.g. used instruments, software, imaging)

2. Contextual details (Metadata) needed to make data meaningful to others

  • What will be the naming convention of your files?
  • What file formats will you be using?

3. Quality control

  • What will you do to ensure that the data will not be erroneous? (during the data generation/collection process, during the data entry process and/or during further data pro-cessing; also list what software or rules do you use to check quality)
  • What are the roles in quality control? (do you ask a collaborator or your supervisor to perform a check on the data)

4. Storage, backup and security

  • What will be the URL where your data will be available?
  • What is your backup plan for the data?
  • Who will own the copyright or intellectual property rights to the data?
  • How and among who will the data be shared during the project and when the project is finished?

6. Protection and privacy

  • If relevant: how are you addressing any ethical or privacy issues? (e.g. limiting access, encryption, anonymization of data)?

2. Naming files and directories

đź“‚ Start your project on the right foot! Organizing your files and directories effectively can save you countless hours and make collaboration a breeze. While there is no single "best" way to structure and name files, different methods may be more suitable for specific projects, depending on the nature of the work and personal preferences.

A common challenge faced by students and researchers alike is maintaining an organized project directory. The key to overcoming this is to start with a well-defined directory structure tailored to each individual project unit, such as a manuscript or thesis. This structure should be informed by your project proposal’s roadmap and Data Management Plan (DMP). While your project units and their requirements may evolve, having a clear starting point can provide a solid foundation for managing your data effectively.

Here, we provide practical guidelines to help you establish a consistent and efficient approach to naming files and directories. These examples are designed to be simple yet adaptable to suit your project's specific needs.

Key Guideline Short Description Long Description
Organize by Relevant Metadata Sort files by compound, technique, date, or collector. Sort files in directories based on essential metadata such as compound, technique, date, or person collecting the data to enhance findability and maintain consistency across your dataset.
Establish Clear Naming Conventions Create a well-defined naming system for files. Define a consistent and intuitive naming convention. Ensure collaborators or users understand the rationale behind the structure and follow it to maintain data integrity and usability.
Fully Describe File Contents in Filenames Include key details in filenames for easy identification. Filenames should comprehensively identify file contents, allowing files to be located easily through search. Avoid abbreviations that may confuse others and ensure clarity for anyone accessing the data in the future.
Avoid Special Characters and Spaces Use underscores instead of spaces or special characters. Do not use special characters like @, #, or spaces in filenames. Instead, use underscores _ to separate words, ensuring filenames are compatible across systems and applications.
Use ISO 8601 Date Format Adopt the YYYYMMDD date format for filenames. Always use the ISO 8601 date format (YYYYMMDD) in filenames to maintain consistent sortability. This format ensures that files are ordered chronologically in directory listings and simplifies future retrieval.

3. Spreadsheets

🔢 Spreadsheets are a powerful tool for organizing, analyzing, and exploring data, but they come with risks if not handled properly. Following best practices ensures your data remains consistent, interpretable, and ready for future use. Key practices include maintaining raw data integrity, using clear and standardized formats, and ensuring tables are well-structured. For long-term storage, consider saving spreadsheets in a CSV (comma-separated values) format, complemented by metadata files that describe the dataset's structure and conventions. Adopting these habits will not only safeguard your data but also enhance its usability for future analyses and collaborations.

Below are essential guidelines to keep your spreadsheet data organized and reliable:

Key Guideline Short Description Long Description
Keep Raw Data Raw Preserve the original data. Always keep a copy of the raw data unchanged. Use a separate file for performing calculations and manipulations to avoid corrupting the original data.
Use Single Rectangular Tables Create one table per spreadsheet. Each spreadsheet should contain a single rectangular table with a single header line. Ensure data is entered consistently across rows and columns.
Avoid Empty Rows or Columns Maintain continuity in tables. Do not leave empty rows or columns in the table to ensure the data structure remains intact and software can read it without errors.
Use Descriptive Column Labels Choose compact, descriptive column names. Column labels should be clear and concise. Avoid spaces or special characters, using only letters, numbers, and underscores _ for readability and compatibility.
Keep Columns Homogeneous Use one data type per column. Ensure that each column contains a single data type or unit. For categorical data, use a consistent set of labels throughout the column.
Align Column Order Across Tables Keep similar columns in the same order. When creating multiple tables, order similar columns consistently to make data easier to compare and merge later.
Standardize Missing and Exceptional Values Define rules for special cases. Choose a clear method to encode missing values, detection limits, and other exceptions, and apply it consistently across the dataset.
Ensure Consistent Formats and Spelling Standardize labels and codes. Use consistent formats and spelling throughout the dataset. For example, use the same labels for categories like gender (M, F) and avoid switching languages.
Separate Date and Time Values Store date and time as distinct columns. Store year, month, day, and time components (if relevant) in separate columns for clarity and compatibility with analysis tools.
Avoid Visual and Pop-Up Annotations Add annotations as columns. Do not use color-coding or pop-up notes in your data. Instead, include annotations as additional columns (e.g., notes) in the table for clarity and exportability.
Use Decimal Degrees for Spatial Data Store GPS coordinates in decimal degrees. Record spatial information using latitude and longitude in decimal degrees (WGS84 coordinate system). This ensures compatibility with GIS and mapping tools.
Include Metadata Document column labels, units, and labels. Provide metadata that explains the meaning of column labels, measurement units, and other conventions. Store this in a separate sheet or a .txt file for easy reference (e.g. the dataset water_quality_2017_05.csv would be accompanied by the metadata files water_quality_2017_05_metadata.txt)
Use CSV for Long-Term Storage Save data in CSV format. For archival purposes, save data as .csv files.

4. File formats

đź“„ The use of standard file formats with consistent naming conventions is critical for maintaining data accessibility and usability in the long term. Thoughtful consideration of file formats can help ensure your data remains identifiable and usable by others in the future.

When selecting tools and formats for storing your data, pay close attention to the following key principles:

  • Preservation and Accessibility: Whenever possible, opt for open-standard formats that are widely recognized and easily reusable. For instance, saving plain text files (.txt) is preferable to proprietary formats like PDFs for preservation purposes.
  • Software and Compatibility: Include information about any specific software or versions required to view your data, such as SPSS v.3 or Microsoft Excel 97-2003.
  • Version Control and Conversion Considerations: Clearly document version control practices and specify if data will transition between formats during its lifecycle. Highlight any features that could be lost during format conversion, such as system-specific labels.

Below, you’ll find detailed recommendations for preferred and acceptable formats across various data types. These guidelines draw on the UK Data Archive's best practices for managing and sharing data.

Data Type Preferred Formats Other Acceptable Formats
Documentation and Scripts - Plain text (.txt)
- Markdown (.md)
- Open Document Text (.odt)
- Rich Text Format (.rtf)
- HTML (.htm, .html)
- Widely-used proprietary formats, e.g., MS Word (.doc/.docx) or MS Excel (.xls/.xlsx)
- XML marked-up text (.xml) according to an appropriate DTD/schema (e.g. XHMTL 1.0)
- PDF/A or PDF (.pdf)
Spectroscopic Data - JCAMP format (NMR, IR, RAMAN, UV, Mass Spectrometry)
Geospatial Data - GeoPackage
- Georeferenced TIFF (.tif, .tfw)
- GeoJSON
- ESRI Shapefile (.shp, .shx, .dbf, .prj, .sbx, .sbn)
- CAD data (.dwg)
- KML (.kml)
- ESRI Geodatabase format (.mdb)
- MapInfo Interchange Format (.mif)
Digital Image Data - TIFF version 6 uncompressed (.tif) - JPEG (.jpeg, .jpg)
- Other TIFF versions (.tif, .tiff)
- JPEG 2000 (.jp2)
- PDF/A or PDF (.pdf)
Digital Video Data - MPEG-4 High Profile (.mp4) - JPEG 2000 (.mj2)
Digital Audio Data - FLAC (.flac)
- WAV (.wav)
- MP3 (spoken word only)
- MP3 (general use)
- AIFF (.aif)
Qualitative (Textual) Data - XML text with appropriate DTD/schema (.xml)
- Rich Text Format (.rtf)
- Plain text, UTF-8 (Unicode) (.txt)
- Plain text, ASCII (.txt)
- HTML (.html)
- MS Word (.doc/.docx)
- LaTeX (.tex)
Quantitative Data with Metadata - SPSS portable format (.por)
- Delimited text with setup file (SPSS, Stata, SAS)
- Structured text or marked-up metadata file (e.g., DDI XML)
- MS Access (.mdb/.accdb)
Quantitative Tabular Data (Minimal) - Comma-separated values (.csv)
- Tab-delimited file (.tab)
- Delimited text using unique delimiters (.txt)
- Proprietary formats (MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf), OpenDocument Spreadsheet (.ods))

5. Writing code

🔧 In research, writing clear and well-structured scripts is critical for ensuring reproducibility, transparency, and collaboration in scientific studies. Following coding best practices can save time, reduce errors, and make your analyses more accessible to others.

It is highly recommended to use code for all processing steps, from data preparation, data filtering & extraction to the final analysis. By writing scripts for these tasks, you ensure that the raw data remains untouched while creating a reproducible workflow. This approach avoids relying on proprietary software, which may offer convenience but often uses file formats and processing steps that are not easily replicated. By embracing coding, you align your workflows with open science principles, ensuring your data and results are accessible, transparent, and reusable - which can ultimately be used to validate your research.

Learning to code may feel like a steep curve at first, but the effort is highly rewarding. Moreover, IBED's Computational Support Team provides coding support and resources to help you get started and overcome challenges. As you develop your skills, you'll gain the ability to automate tasks, handle large datasets more efficiently, and collaborate with others more effectively. For those new to coding, starting with simple steps like organizing code by projects, adding comments and using clear names for variables and functions can significantly improve your scripts. As you become more comfortable, adopting tools like version control systems or writing modular code can further enhance your workflow and facilitate collaboration in larger projects. These best practices are designed to help you produce reliable, reusable, and efficient code for your research. For the real coding enthousiasts, please find here a step-by-step tutorial for making R packages and using git.

Below is a table summarizing these best practices, with links to resources to help you get started.

Key Guideline Short Description Long Description
Use Projects to Organize Your Work Structure your work into well-defined projects. Using projects to organize your code, data, and related materials ensures a coherent structure that is easy to navigate and maintain. Each project should have a clear directory structure and naming conventions to differentiate its components. Define the scope and objectives for each project, and break down the work into smaller, manageable tasks. By structuring your work as individual projects, you can easily collaborate with others, track progress, and isolate changes. Projects should be self-contained, with clear dependencies and documentation on how to run and use them. (Example: .Rproj in RStudio for R-based projects)
Write Readable Code Use clear naming and comments. Choose descriptive names for variables, functions, and files. Add comments to explain complex logic and improve code readability for others and your future self.
Follow a Style Guide Use consistent formatting. Adhere to a style guide to maintain consistency in your code. For R, refer to Hadley Wickham's Style Guide. For Python, follow PEP 8.
Adopt Modular Design Break code into reusable parts. Structure code into functions, classes, or modules for better organization, reusability, and maintenance.
Process Raw Data with Code Use scripts for data processing. Write code to process raw data rather than manipulating it directly in software. This keeps raw data untouched, ensures reproducibility.
Use Version Control Track and manage code changes. Use a version control system like Git to track code changes. Install GitHub Desktop for a beginner-friendly interface. Commit changes often, write meaningful commit messages, and use branches for feature development. See here for setting up git for users of R.
Collaborate Using Pull Requests Review and discuss changes. When collaborating with others, use pull requests to review changes before merging them into the main branch, ensuring code quality and consistency.
Tag Releases Mark stable versions of code. Use tags (e.g., v1.0.0) in Git to identify and organize stable releases, making them easier to reference in future development.
Test Your Code Ensure code reliability. Write unit tests or integration tests to catch errors early and confirm that changes do not break existing functionality.