The esgcet package for ESGF Publication¶
Installation¶
You can install esgcet one of three ways: pip, conda, or git.
Conda & Required Packages¶
We recommend creating a conda env before installing esgcet
:
conda create -n esgf-pub -c conda-forge -c esgf-forge pip cmor autocurator esgconfigparser
conda activate esgf-pub
Note
the command above creates a new environment for the publisher. This is recommended rather than attempting to reuse an existing environment if you wish to upgrade a previous version of the publisher. If you installed esgcet using conda above, the cmor package (different from tables) should be installed at the time you install esgcet automatically, and having cmor in your env may cause conflicts (but not always).
esgcet install:¶
To install esgcet by pip run the following (note the version tag is requried):
pip install esgcet==5.0.0a12
Note
You must specify the version as the v5.0.x is under pre-release. Installing esgcet
will install the previous major version (v3.xx).
To install esgcet into an existing environment using conda run the following:
conda install -c conda-forge -c esgf-forge esgcet
Of course, you can also just tack on esgcet
to the list of packages in the command above when creating your new conda environment as well.
To install esgcet by cloning our github repository (useful if you want to modiy the software): first, you should ensure you have a suitable python in your environment (see above for information on conda, etc.), and then run:
git clone http://github.com/ESGF/esg-publisher.git -b gen-five-pkg
cd esg-publisher
cd pkg
python3 setup.py install
Now you will be able to call all commands in this package from any directory. A default config file, esg.ini
will populate in $HOME/.esg
where $HOME
is your home directory.
NOTE: if you are intending to publish CMIP6 data, the publisher will run the PrePARE module to check all file metadata. To enable this procedure, it is necessry to download CMOR tables before the publisher will successfully run. See those pages for more info.
Config¶
The default config file will look like this:
[DEFAULT]
note = IMPORTANT: please configure below in the [user] section, that is what the publisher will use to read configured settings. The below are marked as necessary or optional variables.
version = 5.0.0a2
data_node = * necessary
index_node = * necessary
cmor_path = * necessary, and must be an absolute path (not relative)
autoc_path = autocurator * optional, default is autocurator conda binary, can be replaced with a file path, relative or absolute
data_roots = * necessary, must be in json loadable dictionary format
cert = ./cert.pem * optional, default assumes cert in current directory, override to change
test = false * optional, default assumes test is off, override to change
project = none * optional, default will be parsed from mapfile name
set_replica = false * optional, default assumes replica publication off
globus_uuid = none * optional
data_transfer_node = none * optional
pid_creds = * necessary
silent = false * optional
verbose = false * optional
[example]
data_node = esgf-data1.llnl.gov
index_node = esgf-node.llnl.gov
cmor_path = /export/user/cmor/Tables
autoc_path = autocurator
data_roots = {"/esg/data": "esgf_data"}
cert = ./cert.pem
test = false
project = CMIP6
set_replica = true
globus_uuid = none
data_transfer_node = none
pid_creds = [{"url": "aims4.llnl.gov", "port": 7070, "vhost": "esgf-pid", "user": "esgf-publisher", "password": "<password>", "ssl_enabled": true, "priority": 1}]
silent = false
verbose = false
[user]
data_node =
index_node =
cmor_path =
autoc_path = autocurator
data_roots =
cert = ./cert.pem
test = false
project = none
set_replica = false
globus_uuid = none
data_transfer_node = none
pid_creds =
silent = false
verbose = falsee
Fill out the necessary variables, and either leave or override the optional configurations. Note that the section the publisher reads is the user
section, not the default nor example.
If you have an old config file from the previous iteration of the publisher, you can use esgmigrate
to migrate over those settings to a new config file which can be read by the current publisher.
See that page for more info.
Run Time Args¶
If you prefer to set certain things at runtime, the esgpublish
command has several optional command line arguments which will override options set in the config file.
For instance, if you use the --cmor-tables
command line argument to set the path to the cmor tables directory, that will override anything written in the config file under cmor_path
.
More details can be found in the esgpublish
section.
Autocurator¶
Install¶
If you do not wish to install autocurator via conda, the option also exists to clone and install it from git:
git clone http://github.com/sashakames/autocurator.git
cd autocurator
make
After running this, there should be an autocurator executable saved as .../autocurator/bin/autocurator
.
You will need to update the config if you choose to do this with the correct path to the autocurator folder, as the default is just the autocurator
command.
Running Autocurator¶
Before running autocurator
(if you are not using the conda installed version) you must first run the following command:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib
This command helps autocurator locate and open shared libraries within the current conda environment. It will not work if this is not run.
This also goes for running the esgpublish
command if, in your config, you have listed a direct path instead of simply the autocurator command.
If you want to run autocurator
as a stand alone, use the following format:
bash autocurator.sh <path to autocurator executable> <full mapfile path> <scan file name (output file)>
The executable itself can also be run like so:
bin/autocurator --out_pretty --out_json <scan file name> --files <dataset directory>
However, this mode is sometimes difficult as specifying multiple files requires using a dir/*.nc
format which sometimes causes issues.
Overall, we recommend using the script above as it cleans up a few things. You can also use the conda install as above, but the path/command will just be “autocurator”.
Once you have your scan file, you can use that to run esgmkpubrec
(see that page for more info).
CMOR¶
Before running the publisher, you will also need to obtain a directory of CMOR tables, used by PrePARE to check the metadata of your files.
You can get this directory either using esgprep
or by cloning the git repository.
esgprep¶
You can install esgprep
using pip:
pip install esgprep
You can also clone their git repository and run setup.py:
git clone git://github.com/ESGF/esgf-prepare.git
cd esgf-prepare
python setup.py install
NOTE: esgprep
uses python 2.6 or greater, but less than python 3.0. Configure your virtual environment as needed.
Following install, simply run:
esgfetchtables
You can specify project using --project
and the output directory using --table-dir
like so:
esgfetchtables --project CMIP6 --table-dir <path>
Once you have fetched the tables, you can update the cmor_path
variable in your config file, or specify it at run time in the command line.
Clone Git Repository¶
Clone the repository:
git clone https://github.com/PCMDI/cmip6-cmor-tables.git
Your tables will be in the folder cmip6-cmor-tables/Tables
(unless you specify a different target directory name for the clone).
You can now update the cmor_path
variable in your config file, or specify it at run time in the command line.
esgmigrate¶
The esgmigrate
command migrates old config settings from the old publisher into a new config file formatted for the current new publisher.
The output will be found in ~/.esg/esg.ini
which is the default config file path the publisher will read from.
Usage¶
esgmigrate
is used with the following syntax:
esgmigrate <ini_directory_path>
Where <ini_directory_path>
is an optional argument specifying a directory to an old esg.ini
file to migrate.
The default directory path is /esg/config/esgcet
.
esgpublish¶
The esgpublish
command publishes a record from start to finish using the mapfile(s) passed to it. On success, it will display a success message in the output of the last two steps.
If an error occurs, a helpful statement will be printed explaining which step went wrong and why.
Usage¶
esgpublish
is used with the following syntax:
esgpublish --map <mapfile>
You can also use --help
to see:
$ esgpublish --help
usage: esgpublish [-h] [--test] [--set-replica] [--no-replica] [--json JSON] [--data-node DATA_NODE] [--index-node INDEX_NODE] [--certificate CERT] [--project PROJ]
[--cmor-tables CMOR_PATH] [--autocurator AUTOCURATOR_PATH] --map MAP
Publish data sets to ESGF databases.
optional arguments:
-h, --help show this help message and exit
--test PID registration will run in 'test' mode. Use this mode unless you are performing 'production' publications.
--set-replica Enable replica publication.
--no-replica Disable replica publication.
--json JSON Load attributes from a JSON file in .json form. The attributes will override any found in the DRS structure or global attributes.
--data-node DATA_NODE
Specify data node.
--index-node INDEX_NODE
Specify index node.
--certificate CERT, -c CERT
Use the following certificate file in .pem form for publishing (use a myproxy login to generate).
--project PROJ Set/overide the project for the given mapfile, for use with selecting the DRS or specific features, e.g. PrePARE, PID.
--cmor-tables CMOR_PATH
Path to CMIP6 CMOR tables for PrePARE. Required for CMIP6 only.
--autocurator AUTOCURATOR_PATH
Path to autocurator repository folder.
--map MAP mapfile or file containing a list of mapfiles.
--ini CFG, -i CFG Path to config file.
This command can handle a singular mapfile passed to it, or a file containing a list of mapfiles (with full paths).
If optional command line arguments are used, they will override anything set in the config file.
NOTE: If, in your config file, you have specified a directory for autocurator
rather than the default command, ie you are using a different autocurator
than the one installed using conda, you must run the following command prior to running esgpublish
:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib
If you do not run this and are not using the conda installed autocurator
, the program will not work.
esgmapconv¶
The esgmapconv
command executes the first step of the publishing protocol by converting metadata from a mapfile into json data.
That data is the input to the esgmkpubrec
command.
Usage¶
esgmapconv
is used with the following syntax:
esgmapconv --map <mapfile>
where <mapfile>
is the absolute path to a single mapfile. The output will be printed to stdout, but can be easily redirected to a chosen file using the --out-file
option.
You can also use the other command line options for additional configuration:
usage: esgmapconv [-h] [--project PROJ] --map MAP [--out-file OUT_FILE]
Publish data sets to ESGF databases.
optional arguments:
-h, --help show this help message and exit
--project PROJ Set/overide the project for the given mapfile, for use with selecting the DRS or specific features, e.g. PrePARE, PID.
--map MAP Mapfile ending in .map extension, contains metadata about the record.
--out-file OUT_FILE Output file for map data in JSON format. Default is printed to standard out.
Using the command line option -h
will display the above message.
The above options (excluding --map
) can be defined in the config file instead of the command line if you choose.
esgmkpubrec¶
The esgmkpubrec
command uses the output data from esgmapconv
to populate metadata for the dataset and file records.
This command also requires the output of the autocurator command, which populates additional metadata using the mapfile and puts it into a separate json file.
This output is the input to the esgpidcitepub
command.
Usage¶
esgmkpubrec
is used with the following syntax:
esgmkpubrec --scan-file <scan file> --map-data <JSON file>
where <JSON file>
is the aforementioned output from esgmapconv
and <scan file>
is the output of autocurator<https://github.com/lisi-w/autocurator>`_.
The output is again defaulted to stdout, but can easily be redirected using the ``--out-file
option.
The other command line options are as follows:
usage: esgmkpubrec [-h] [--set-replica] [--no-replica] --scan-file SCAN_FILE [--json JSON] [--data-node DATA_NODE] [--index-node INDEX_NODE] --map-data MAP_DATA [--ini CFG]
[--out-file OUT_FILE]
Publish data sets to ESGF databases.
optional arguments:
-h, --help show this help message and exit
--set-replica Enable replica publication.
--no-replica Disable replica publication.
--scan-file SCAN_FILE
JSON output file from autocurator.
--json JSON Load attributes from a JSON file in .json form. The attributes will override any found in the DRS structure or global attributes.
--data-node DATA_NODE
Specify data node.
--index-node INDEX_NODE
Specify index node.
--map-data MAP_DATA Mapfile json data converted using esgmapconv.
--ini CFG, -i CFG Path to config file.
--out-file OUT_FILE Optional output file destination. Default is stdout.
esgpidcitepub¶
The esgpidcitepub
command connects to a PID server using credentials defined in the config file. It then assigns a PID to the dataset. This step is necessary for all CMIP6 data records.
The output of this command is the input to both the esgupdate
command as well as the esgindexpub
command.
Usage¶
esgpidcitepub
is used with the following syntax:
esgpidcitepub --pub-rec <JSON file>
where <JSON file>
is the output of the esgmkpubrec
command.
The output of this command is by default printed to stdout, but can easily be redirected using the --out-file
option.
The other command line options are as follows:
usage: esgpidcitepub [-h] [--data-node DATA_NODE] --pub-rec JSON_DATA [--ini CFG] [--out-file OUT_FILE]
Publish data sets to ESGF databases.
optional arguments:
-h, --help show this help message and exit
--data-node DATA_NODE
Specify data node.
--pub-rec JSON_DATA Dataset and file json data; output from esgmkpubrec.
--ini CFG, -i CFG Path to config file.
--out-file OUT_FILE Optional output file destination. Default is stdout.
You can also define the above options (aside from --pub-rec
) in the config file if you choose.
esgupdate¶
The esgupdate
command checks to see if the dataset being published is already in our database. If it is, it uses the metadata produced by the other commands to update the record.
The output is the published data along with a success message upon success.
Usage¶
esgupdate
is used with the follwing syntax:
esgupdate --pub-rec <JSON file>
where <JSON file>
is the output of the esgpidcitepub
command.
Additional command line options are as follows:
usage: esgupdate [-h] [--index-node INDEX_NODE] [--certificate CERT] --pub-rec JSON_DATA [--ini CFG]
Publish data sets to ESGF databases.
optional arguments:
-h, --help show this help message and exit
--index-node INDEX_NODE
Specify index node.
--certificate CERT, -c CERT
Use the following certificate file in .pem form for publishing (use a myproxy login to generate).
--pub-rec JSON_DATA JSON file output from esgpidcitepub or esgmkpubrec.
--ini CFG, -i CFG Path to config file.
You can also define most of these options in the config file if you choose.
esgindexpub¶
The esgindexpub
command publishes the data record using the metadata produced by the other commands to the index_node
defined in the config file.
The output of this command will display published data along with a success message upon success.
Usage¶
esgindexpub
is used with the following syntax:
esgindexpub --pub-rec <JSON file>
where <JSON file>
is the output of the esgpidcitepub
command.
You can also use the other command line options to configure some variables outside of the config file (or to define where to find the config file):
usage: esgindexpub [-h] [--index-node INDEX_NODE] [--certificate CERT] --pub-rec JSON_DATA [--ini CFG]
Publish data sets to ESGF databases.
optional arguments:
-h, --help show this help message and exit
--index-node INDEX_NODE
Specify index node.
--certificate CERT, -c CERT
Use the following certificate file in .pem form for publishing (use a myproxy login to generate).
--pub-rec JSON_DATA JSON file output from esgpidcitepub or esgmkpubrec.
--ini CFG, -i CFG Path to config file.
Use the command line option -h
to see the message above.
Esgcet is a package of publisher commands for publishing to the ESGF search database.