Publisher Introduction

The esg-publisher or esgcet Python package contains a collection of command-line utilities to scan, manipulate and push dataset metadata to an ESGF index node. The basic publication process includes several basic steps and sometimes optional steps. Publisher functionality is available via several submodles/classes in the package.

The publisher software has undergone a significant change starting with v5.* of the software. Prior versions involved storage of dataset metadata in the legacy ESGF data node PostgreSQL database and generation of THREDDS catalogs. The actual publication to the ESGF index occured via catalog harvesting. Instead, the more recent publisher simplifies the process with the following phases:

  1. Local scan of datasets (uses xarray by default)

  2. Record generation using scan, mapfile and auxiliary (json) information/files as input

  3. Update check of existing dataset, previous version manipulation.

  4. Push/publish of record(s) to ESGF index

And an optional project-specific phase:

  • PID registration and citiation URL generation (CMIP6 and input4MIPs)

For those familiar with the previous publisher, please be aware of the following distinctions between earlier versions and v5.*

  • A Python3 environment is required (most prior versions have run Python2)

  • The configuration (.yaml) file format replaces the old .ini format. Note that the old format for project-specific .ini files are still used by the esgf-prepare tools (eg. esgmapfile). The v5. publisher has the ability to migrate the needed settings from the previous ini files.

  • Prior invocation of esgpublish required use of --thredds and --publish stages. Those arguments are eliminated. In the general case, you can run esgpublish in a single command. Advanced users may chose to run the individual publishing steps separately to create workflows, for instance, in the use of an external workflow manager.

What esg-publisher DOES NOT do

Please note that the ESGF publisher software in this module does not upload data to ESGF. It is meant to register dataset metadata of data hosted at a site that runs an ESGF data node. See the ESGF site for more information on data nodes. Nor does the Python module manage a workflow of multiple datasets. Employ other tools or techniques to manage such worklows.

Prerequisites

  • A python environment, eg. create one using venv, conda (no longer free) or miniforge/mamba.

  • Mountpoint to located data on the same host as publisher software installation, so the publisher scan utility (eg. autocurator) has access.

  • Basic dataset information provided via the esg mapfile format (.map). The most popular approach is using the esgf-prepare/esgmapfile utility.

  • Familiarity with your project’s requirements and ensure your data has been formatted to the project’s specifications. For instance, if you are participating in CMIP, you have read all the relevant information provided on the CMIP website