eccc_to_commons - Set of tools to replicate
                  Environment and Climate change Canada data on
                  Wikimedia Commons

This is a collection of scripts (mainly Bash and XSD/XSLT).

Most of them use standard Unix/GNU tools so it should work on any recent GNU
distribution. In addition to coreutils, prerequisites are:

- Bash 4+
- Curl
- Xmlstarlet
- Jq

This repository is sponsored by Wikimedia Canada.


Provided scripts, ordered by chronological usage:
dllist.sh                 outputs a curl configuration file listing all available
                          ECCC data
eccc_fixer.sh             fix upstream data XML files
eccc_fixer.xslt           fix upstream data XML file
commons_rules.xsd         validate ECCC XML from a Wikimedian point of view
eccc_merger.sh            merge multiple ECCC XML files
eccc_to_commons.sh        transform ECCC XML files into JSON
monthly_to_commons.xslt   transform ECCC monthly XML file into JSON
almanac_to_commons.xslt   transform ECCC almanac XML file into JSON
mediawiki_post.sh         upload directory to a Mediawiki


Usage:
The general idea of a large scale data import process is:
 1. Download a copy of all data required for the import
 2. Validate the cache
 3. Transform data into the target format
 4. Upload to the destination

These tools require some technical knowledge prior to using them so they aren't
for general use. This is however a good starting point to discover how large
imports are processed in community driven projects.


In practice:
Beside a reasonable amount of available disk space, you will have to create two
distinct folders: the first will contain a copy of ECCC downloaded data while
the second will contain the data to be uploaded to Wikimedia Commons.
The following section will refer to them as ${ECCC_CACHE} and ${COMMONS_CACHE}.
These environment variables must be set or replaced by valid paths when the
commands are used.


1.  Download a copy of all data required for the import
1.1 Create a list of all ECCC provided files
First, we generate a list of all the historical data provided by ECCC.

 $ ./dllist.sh "${ECCC_CACHE}" > "downloads_all"

Expect quite long runtime. As of January 2020, it generates a list with almost
650,000 download links.


1.2 Filter unwanted files
This long list may contain more files than you actually need, so you may want to
reduce it so you don't have to download/store useless content.
This step basically depends on your own needs, so not all cases will be covered
here. downloads_all is a regular text file, so you can edit it with regular
tools like sed, grep or your prefered interactive editor.

Here are a few examples to inspire you:

Keep only monthly data:
  $ cat downloads_all | grep -B1 -A1 --no-group-separator \
    -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly

Remove all downloads before (restart interrupted download):
  $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
    downloads_all > download_continue


1.3 Download wanted files
Give your own list of downloads to curl. You can add the parameters you need.

  $ curl --fail-early --create-dirs -K download_all


2 Fix the files
Be aware the files you've downloaded is buggy. Yes, all of them, they're
distributed as it by ECCC. But wait, there is a simple fix.

The clean way to perform the fix is to use the following script:

  $ ./eccc_fixer.sh "${ECCC_CACHE}" "${ECCC_CACHE}-fixed"


However, if you don't want to keep the original files, you can just do:

  $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec sed -i -e \
    's/xsd:schemaLocation/xsi:schemaLocation/;s/xmlns:xsd="http:\/\/www.w3.org\/TR\/xmlschema-1\/"/xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance"/' {} +

From now on, the guide expects "${ECCC_CACHE}" to point the directory with fixed
files only.


3.  Validate the cache
It's important you make the effort to validate the files before processing them.
Every transformation makes assumptions on data structure/content that can only
be asserted by using proper validation schemes. Bypassing this step may lead to
not working transformations or invalid final data.
This step is split in two: first we have to check the data is valid from an ECCC
point of view and then we check it's valid through Wikimedian eyes.

3.1 Validate the data according to ECCC standards
However, the XML schema distributed by ECCC is incorrect. It won't validate any XML
coming from them. A fixed version can be found on Wikimedia Canada Git
repositories.

  $ git clone https://git.wikimedia.ca/eccc_schema.git
  $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
    -s eccc_schema/bulkschema.xsd {} \;

The second command will list all incorrect files. If output is empty, you can
continue.


3.2 Validate the data according to Wikimedia standards

  $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
    -s commons_rules.xsd {} \;

Same as previously, the output should be empty. Otherwise, you must resolve
every single problem before continuing.


[OPTIONAL STEP] Merge multiple XML files
Sometimes, having per station granularity is too accurate. If you need to merge
two or more XML files, you can use the eccc_merge.sh script:

  $ ./eccc_merger.sh "${ECCC_CACHE}/almanac/3050519.xml" \
    "${ECCC_CACHE}/almanac/3050520.xml" "${ECCC_CACHE}/almanac/3050521.xml" \
    "${ECCC_CACHE}/almanac/3050522.xml" "${ECCC_CACHE}/almanac/3050526.xml" \
    > banff.xml

In order to get stations ids based on their geographical position, you can use
the eccc_map tool. A public instance is hosted online at
https://stations.wikimedia.ca/ .


4. Transform data into target format
Here we are, here is the fun part: let's create weather data in Wikimedia
Commons format.

  $ ./eccc_to_commons.sh "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log

It will replicate the future Commons content paths inside nested directories.
So, for example future
https://commons.wikimedia.org/wiki/Data:weather.gc.ca/Monthly/4271.tab resource
will be created in ${COMMONS_CACHE}/weather.gc.ca/Monthly/4271.tab.
A sum up log file is created for further reference on what has been done during
conversion.


5. Upload to destination
It's now time to share our work with the world and that's the purpose of the
mediawiki_post.sh script.

  $ ./mediawiki_post.sh "${COMMONS_CACHE}"

It takes the commons cache as parameter: its file hierarchy will be replicated
on commons. On first run, it will ask credentials for the Mediawiki account to use to
perform the import.