All Learning Resources
Why Cite Data?
This video explains what data citation is and why it's important. It also discusses what digital object identifiers (DOIs) are and how they are used.
MANTRA Research Data Management Training
MANTRA is a free, online non-assessed course with guidelines to help you understand and reflect on how to manage the digital data you collect throughout your research. It has been crafted for the use of post-graduate students, early career researchers, and also information professionals. It is freely available on the web for anyone to explore on their own.
Through a series of interactive online units you will learn about terminology, key concepts, and best practice in research data management.
There are eight online units in this course and one set of offline (downloadable) data handling tutorials that will help you:
Understand the nature of research data in a variety of disciplinary settings
Create a data management plan and apply it from the start to the finish of your research project
Name, organise, and version your data files effectively
Gain familiarity with different kinds of data formats and know how and when to transform your data
Document your data well for yourself and others, learn about metadata standards and cite data properly
Know how to store and transport your data safely and securely (backup and encryption)
Understand legal and ethical requirements for managing data about human subjects; manage intellectual property rights
Understand the benefits of sharing, preserving and licensing data for re-use
Improve your data handling skills in one of four software environments: R, SPSS, NVivo, or ArcGIS
OntoSoft Tutorial: A distributed semantic registry for scientific software
An overview of the EDI data repository and data portal
The Environmental Data Initiative (EDI) data repository is a metadata-driven archive for environmental and ecological research data described by the Ecological Metadata Language (EML). This webinar will provide an overview of the PASTA software used by the repository and demonstrate the essentials of uploading a data package to the repository through the EDI Data Portal.
FAIR Self-Assessment Tool
The FAIR Data Principles are a set of guiding principles in order to make data findable, accessible, interoperable and reusable (Wilkinson et al., 2016). Using this tool you will be able to assess the 'FAIRness' of a dataset and determine how to enhance its FAIRness (where applicable).
This self-assessment tool has been designed predominantly for data librarians and IT staff but could be used by software engineers developing FAIR Data tools and services, and researchers provided they have assistance from research support staff.
You will be asked questions related to the principles underpinning Findable, Accessible, Interoperable and Reusable. Once you have answered all the questions in each section you will be given a ‘green bar’ indicator based on your answers in that section, and when all sections are completed, an overall 'FAIRness' indicator is provided.
Webinar: Jupyter as a Gateway for Scientific Collaboration and Education
Project Jupyter, evolved from the IPython environment, provides a platform for interactive computing that is widely used today in research, education, journalism, and industry. The core premise of the Jupyter architecture is to design tools around the experience of interactive computing, building an environment, protocol, file format and libraries optimized for the computational process when there is a human in the loop, in a live iteration with ideas and data assisted by the computer.
The Jupyter Notebook, a system that allows users to compose rich documents that combine narrative text and mathematics together with live code and the output of computations in any format compatible with a web browser (plots, animations, audio, video, etc.), provides a foundation for scientific collaboration. The next generation of the Jupyter web interface, JupyterLab, will combine in a single user interface not only the notebook but multiple other tools to access Jupyter services and remote computational resources and data. A flexible and responsive UI allows the user to mix Notebooks, terminals, text editors, graphical consoles and more, presenting in a single, unified environment the tools needed to work with a remote environment. Furthermore, the entire design is extensible and based on plugins that interoperate via open APIs, making it possible to design new plugins tailored to specific types of data or user needs.
JupyterHub enables Jupyter Notebook and JupyterLab to be used by groups of users for research collaboration and education. We believe JupyterHub provides a foundation on which to build modern scientific gateways that support a wide range of user scenarios, from interactive data exploration in high-level languages like Python, Julia or R, to the education of researchers and students whose work relies on traditional HPC resources.
The presenter discusses the benefits and applications of Jupyter Notebooks.
Scroll to the bottom of the page to view the webinar. Presentation slides are also available on the same page.
Seismic Data Quality Assurance Using IRIS MUSTANG Metrics
Seismic data quality assurance involves reviewing data in order to identify and resolve problems that limit the use of the data – a time-consuming task for large data volumes! Additionally, no two analysts review seismic data in quite the same way. Recognizing this, IRIS developed the MUSTANG automated seismic data quality metrics system to provide data quality measurements for all data archived at IRIS Data Services. Knowing how to leverage MUSTANG metrics can help users quickly discriminate between usable and problematic data and it is flexible enough for each user to adapt it to their own working style.
This tutorial presents strategies for using MUSTANG metrics to optimize your own data quality review. Many of the examples in this tutorial illustrate approaches used by the IRIS Data Services Quality Assurance (QA) staff.
R Class for Seismologists
The IRIS Data Management Center (DMC) archives and distributes data to support the seismological research community. The class described here introduces DMC and other seismologists to the R statistical programming language and its use with seismological data available from DMC web services. The capabilities of the seismicRoll, IRISSeismic and IRISMustangMetrics packages developed as part of the MUSTANG project will be demonstrated.
Class materials are broken up into nine separate lessons that assume some experience coding but not necessarily any familiarity with R. Lessons are presented in sequential order and assume the student already has R and RStudio installed on their computer. Autodidacts new to R should take about 20-30 hrs to complete the course. The target audience for these materials consists of IRIS DMC employees or graduate students with a degree in the natural sciences and some experience using scientific software such as MATLAB or Python.
Research Lifecycle at University of Central Florida
Short video discussing the "Research Lifecycle at University of Central Florida," a useful diagram for understanding the typical flow of a research project.
Research Data Management Community Training
Good research data management is of great importance for high-quality research. Implementing professional research data management from the start helps to avoid problems in the data creation and curation phases.
- Definition(s) of RDM
- Benefits and Advantages of RDM
- Research Data Life-Cycle
- Structure and components of RDM
- Recommended literature
Access Policies and Usage Regulations: Licenses
The webinar about licensing and policy will look into why it is important that research data are provided with licenses.
- Benefits of sharing research data
- Types of licenses
- Data ownership and reuse
- Using creative commons in archiving research data
During the workshop, participants will acquire a basic knowledge of data licensing.
Introduction to R
Learn the basics of reproducible workflows in R using a USGS National Water Information System (NWIS) dataset.
Instructions for accessing the dataset are provided within the tutorial.
Clean your taxonomy data with the taxonomyCleanr R package
Taxonomic data can be messy and challenging to work with. Incorrect spelling, the use of common names, unaccepted names, and synonyms, contribute to ambiguity in what a taxon actually is. The taxonomyCleanr R package helps you resolve taxonomic data to a taxonomic authority, get accepted names and taxonomic serial numbers, as well as create metadata for your taxa in the Ecological Metadata Language (EML) format.
Postgres, EML and R in a data management workflow
Metadata storage and creation of Ecological Metadata Language (EML) can be a challenge for people and organizations who want to archive their data. A workflow was developed to combine efficient EML record generation (using the package developed by the R community) with centrally-controlled metadata in a relational database. The webinar has two components: 1) a demonstration of metadata storage and management using a relational database, and 2) discussion of an example EML file generation workflow using pre-defined R functions.
Experimental Design Assistant
The Experimental Design Assistant (EDA) (https://eda.nc3rs.org.uk) is a free web-based tool that was developed by the NC3Rs (https://www.nc3rs.org.uk). It guides researchers through the design and analysis of in vivo experiments. The EDA allows users to build a stepwise visual representation of their experiment, providing feedback and dedicated support for randomization, blinding and sample size calculation. This demonstration will provide an introduction to the tool and provide guidance on getting started. Ultimately, the use of a tool such as the EDA will lead to carefully designed experiments that yield robust and reproducible data using the minimum number of animals consistent with scientific objectives.
Open Science and Innovation
This course helps you to understand open business models and responsible research and innovation (RRI) and illustrates how these can foster innovation. By the end of the course, you will:
- Understand key concepts and values of open business models and responsible research and innovation
- Know how to plan your innovation activities
- Be able to use Creative Commons licenses in business
- Understand new technology transfer policies with the ethos of Open Science
- Learn how to get things to market faster
Data Management using NEON Small Mammal Data
Undergraduate STEM students are graduating into professions that require them to manage and work with data at many points of a data management lifecycle. Within ecology, students are presented not only with many opportunities to collect data themselves but increasingly to access and use public data collected by others. This activity introduces the basic concept of data management from the field through to data analysis. The accompanying presentation materials mention the importance of considering long-term data storage and data analysis using public data.
Coffee and Code: Write Once Use Everywhere (Pandoc)
Pandoc (http://pandoc.org) is a document processing program that runs on multiple operating systems (Mac, Windows, Linux) and can read and write a wide variety of file formats. In many respects, Pandoc can be thought of as a universal translator for documents. This workshop focuses on a subset of input and output document types, just scratching the surface of the transformations made possible by Pandoc.
Click 00-Overview.ipynb on the provided GitHub page or go directly to the overview, here:
Coffee and Code: Introduction to Version Control
This is a tutorial about version control, also known as revision control, a method for tracking changes to files and folders within a source code tree, project, or any complex set of files or documents.
Also see Advanced Version Control, here: https://github.com/unmrds/cc-version-control/blob/master/03-advanced-ver...
Licensing your research outputs is an important part of practicing Open Science. After completing this course, you will:
- Know what licenses are, how they work, and how to apply them
- Understand how different types of licenses can affect research output reuse
- Know how to select the appropriate license for your research
Singularity User Guide
Singularity is a container solution created by necessity for scientific and application driven workloads. .
Over the past decade and a half, virtualization has gone from an engineering toy to a global infrastructure necessity and the evolution of enabling technologies has flourished. Most recently, we have seen the introduction of the latest spin on virtualization… “containers”.
Many scientists, especially those involved with the high performance computation (HPC) community, could benefit greatly by using container technology, but they need a feature set that differs somewhat from that available with current container technology. This necessity drives the creation of Singularity and articulated its four primary functions:
- Mobility of compute
- User freedom
- Support on existing traditional HPC
This user guide introduces Singularity, a free, cross-platform and open-source computer program that performs operating-system-level virtualization also known as containerization.
Make EML with R and share on GitHub
Introduction to the Ecological Metadata Language (EML). Topics include:
- Use R to build EML for a mock dataset
- Validate EML and write to file
- Install Git and configure to track file versioning in RStudio
- Set up GitHub account and repository
- Push local content to GitHub for sharing and collaboration
Access the rendered version of this tutorial here:https://cdn.rawgit.com/EDIorg/tutorials/2002b911/make_eml_with_r/make_em...
Florilege, a new database of habitats and phenotypes of food microbe flora
This tutorial explains how to use the “Habitat-Phenotype Relation Extractor for Microbes” application available from the OpenMinTeD platform. It also explains the scientific issues it addresses, and how the results of the TDM process can be queried and exploited by researchers through the Florilège application.
In recent years, developments in molecular technologies have led to an exponential growth of experimental data and publications, many of which are open, however accessible separately. Therefore, it is now crucial for researchers to have bioinformatics infrastructures at their disposal, that propose unified access to both data and related scientific articles. With the right text mining infrastructures and tools, application developers and data managers can rapidly access and process textual data, link them with other data and make the results available for scientists.
The text-mining process behind Florilege has been set up by INRA using the OpenMinTeD environment. It consists in extracting the relevant information, mostly textual, from scientific literature and databases. Words or word groups are identified and assigned a type, like “habitat” or “taxon”.
Sections of the tutorial:
1. Biological motivation of the Florilege database
2. Florilège Use-Case on OpenMinTeD (includes a description of how to access the Habitat-Phenotype Relation Extractor for Microbes application)
3. Florilege backstage: how is it build?
4. Florilège description
5. How to use Florilege ?
Best Practice in Open Research
This course introduces some practical steps toward making your research more open. We begin by exploring the practical implications of open research, and the benefits it can deliver for research integrity and public trust, as well as benefits you will accrue in your own work. After a short elaboration of some useful rules of thumb, we move quickly onto some more practical steps towards meeting contemporary best practice in open research and introduce some useful discipline-specific resources. Upon completing this course, you will:
- Understand the practical implications of taking a more open approach to research
- Be prepared to meet expectations relating to openness from funders, publishers, and peers
- Be able to reap the benefits of working openly
- Have an understanding of the guiding principles to follow when building openness into your research workflow
- Know about some useful tools and resources to help you embed Open Science into work research practices
Managing and Sharing Research Data
Data-driven research is becoming increasingly common in a wide range of academic disciplines, from Archaeology to Zoology, and spanning Arts and Science subject areas alike. To support good research, we need to ensure that researchers have access to good data. Upon completing this course, you will:
- Understand which data you can make open and which need to be protected
- Know how to go about writing a data management plan
- Understand the FAIR principles
- Be able to select which data to keep and find an appropriate repository for them
- Learn tips on how to get maximum impact from your research data
GeoNode for Developers Workshop
GeoNode is a web-based application and platform for developing geospatial information systems (GIS) and for deploying spatial data infrastructures (SDI). It is designed to be extended and modified and can be integrated into existing platforms.
This workshop covers the following topics:
- GeoNode in development mode, how to
- The geonode-project to customize GeoNode
- Change the look and feel of the application
- Add your own app
- Add your own models, view, and logic
- Build your own APIs
- Add a third party app
- Deploy your customized GeoNode
To access geonode-project on GitHub, go to https://github.com/GeoNode/geonode-project .
Coffee and Code: Advanced Version Control
Learn advanced version control practices for tracking changes to files and folders within a source code tree, project, or any complex set of files or documents.
This tutorial builds on concepts taught in "Introduction to Version Control," found here: https://github.com/unmrds/cc-version-control/blob/master/01-version-cont....
Git Repository for this Workshop: https://github.com/unmrds/cc-version-control
Science Impact of Sustained Cyberinfrastructure: The Pegasus Example
This talk is the first in a series of NSF's Office of Advanced Cyberinfrastructure (OAC) webinars. Dr. Deelman describes the challenges of developing and sustaining cyberinfrastructure capabilities that have impact on scientific discovery and that innovate in the changing cyberinfrastructure landscape. The recent multi-messenger observation triggered by LIGO and VIRGO’s first detection of gravitational waves produced by colliding neutron stars is a clear display of the increasing impact of dependable research cyberinfrastructure (CI) on scientific discovery.
Today’s cyberinfrastructure—hardware, software, and workforce—underpins the entire scientific workflow, from data collection at instruments, through complex analysis, to simulation, visualization, and analytics. The Pegasus project in an example of a cyberinfrastructure effort that enables LIGO and other communities to accomplish their scientific goals. In addition, it delivers robust automation capabilities to researchers at the Southern California Earthquake Center (SCEC) studying seismic phenomena, to astronomers seeking to understand the structure of the universe, to material scientists developing new drug delivery methods, and to students seeking to understand human population migration.
Environmental Data Initiative Five Phases of Data Publishing Webinar - What are metadata and structured metadata?
Metadata are essential to understanding a dataset. The talk covers:
- How structured metadata are used to document, discover, and analyze ecological datasets.
- Tips on creating quality metadata content.
- An introduction to the metadata language used by the Environmental Data Initiative, Ecological Metadata Language (EML). EML is written in XML, a general purpose mechanism for describing hierarchical information, so some general XML features and how these apply to EML are covered.
This video in the Environmental Data Initiative (EDI) "Five Phases of Data Publishing" tutorial series covers the third phase of data publishing, describing.
Environmental Data Initiative Five Phases of Data Publishing Webinar - Make metadata with the EML assembly line
High-quality structured metadata is essential to the persistence and reuse of ecological data; however, creating such metadata requires substantial technical expertise and effort. To accelerate the production of metadata in the Ecological Metadata Language (EML), we’ve created the EMLassemblyline R code package. Assembly line operators supply the data and information about the data, then the machinery auto-extracts additional content and translates it all to EML. In this webinar, the presenter will provide an overview of the assembly line, how to operate it, and a brief demonstration of its use on an example dataset.
This video in the Environmental Data Initiative (EDI) "Five Phases of Data Publishing" tutorial series covers the third phase of data publishing, describing.
Environmental Data Initiative Five Phases of Data Publishing Webinar - Creating "clean" data for archiving
Not all data are easy to use, and some are nearly impossible to use effectively. This presentation lays out the principles and some best practices for creating data that will be easy to document and use. It will identify many of the pitfalls in data preparation and formatting that will cause problems further down the line and how to avoid them.
This video in the Environmental Data Initiative (EDI) "Five Phases of Data Publishing" tutorial series covers the second phase of data publishing, cleaning data. For more guidance from EDI on data cleaning, also see "How to clean and format data using Excel, OpenRefine, and Excel," located here: https://www.youtube.com/watch?v=tRk01ytRXjE.
Environmental Data Initiative Five Phases of Data Publishing Webinar - How to clean and format data using Excel, OpenRefine, and Excel
This webinar provides an overview of some of the tools available for formatting and cleaning data, guidance on tool suitability and limitations, and an example dataset and instructions for working with those tools.
This video in the Environmental Data Initiative (EDI) "Five Phases of Data Publishing" tutorial series covers the second phase of data publishing, cleaning data.
For more guidance from EDI on data cleaning, also see " Creating 'clean' data for archiving," located here: https://www.youtube.com/watch?v=gW_-XTwJ1OA.
A FAIR afternoon: on FAIR data stewardship for Technology Hotel (/ETH4) beneficiaries
FAIR data awareness event for Enabling Technology Hotels 4ed. One of the aims of the Enabling Technologies Hotels programme, is to promote the application of the FAIR data principles in research data stewardship, data integration, methods, and standards. This relates to the objective of the national plan open science that research data have to be made suitable for re-usability.
With this FAIR data training, ZonMw and DTL aim to help researchers (hotel guests and managers) that have obtained a grant in the 4th round of the programme to apply FAIR data management in their research.
Intro to SQL for Data Science
The role of a data scientist is to turn raw data into actionable insights. Much of the world's raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases. Therefore, to be an effective data scientist, you must know how to wrangle and extract data from these databases using a language called SQL (pronounced ess-que-ell, or sequel). This course teaches you everything you need to know to begin working with databases today!
USGS Data Templates Overview
Creating Data Templates for data collection, data storage, and metadata saves time and increases consistency. Utilizing form validation increases data entry reliability.
- Why use data templates?
- Templates During Data Entry - how to design data validating templates
- After Data Entry - ensuring accurate data entry
- Data Storage and Metadata
- Best Practices
- Data Templates
- Long-term Storage
- Tools for creating data templates
- Google Forms
- Microsoft Excel
- Microsoft Access
- OpenOffice - Calc
Mozilla Science Lab Open Data Instructor Guides
This site is a resource for train-the-trainer type materials on Open Data. It's meant to provide a series of approachable, fun, collaborative workshops where each of the modules is interactive and customizable to meet a variety of audiences.
ISRIC Spring School
The ISRIC Spring School aims to introduce participants to world soils, soil databases, software for soil data analysis and visualisation, digital soil mapping and soil-web services through two 5-day courses run in parallel. Target audiences for the Spring School include soil and environmental scientists involved in (digital) soil mapping and soil information production at regional, national and continental scales; soil experts and professionals in natural resources management and planning; and soil science students at MSc and PhD level. Examples courses include "World Soils and their Assessment (WSA) and Hands-on Global Soil Information Facilities (GSIF). Data management topics are included within the course topics.
U.S. Fish and Wildlife Service National Conservation Training Center
The National Conservation Training Center (NCTC) of the U.S. Fish and Wildlife (USFWS) provides a search service on top of a catalog of the courses offered at the NCTC physical location and online that are related to data skills, and data management. The courses include instructor led, online self study, online instructor led courses, and webinars. Some courses are free; others have a fee associated with them. Many of the courses use various GIS data sources and systems including USFWS datasets that can be found at: https://www.fws.gov/gis/data/national/index.html. The NCTC provides a searching interface on its home page.
Hands-on Intro to SQL (Structured Query Language)
This workshop will teach the basics of working with and querying structured data in a database environment. This workshop uses the SQLite plugin for Firefox. The data used is a time-series for a small mammal community in southern Arizona in the southern United States. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.