Complete process from ECCC website to Commons files generation

[eccc_to_commons.git] / README
diff --git a/README b/README

new file mode 100644 (file)

index 0000000..6db096e
--- /dev/null
+++ b/README
@@ -0,0 +1,146 @@
+eccc_to_commons - Set of tools to replicate
+                  Environment and Climate change Canada data on
+                  Wikimedia Commons
+
+This is a collection of scripts (mainly Bash and XSD/XSLT).
+
+Most of them use standard Unix/GNU tools so it should work on any recent GNU
+distribution. In addition to coreutils, prerequisites are:
+
+- Bash 4+
+- Curl
+- Xmlstarlet
+
+This repository is sponsored by Environment and Climate change Canada and
+Wikimedia Canada.
+
+
+Provided scripts, ordered by chronological usage:
+dllist.sh                 outputs a curl configuration file listing all available
+                          ECCC data
+eccc_fixer.sh             fix upstream data XML files
+eccc_fixer.xslt           fix upstream data XML file
+commons_rules.xsd         validate ECCC XML from a Wikimedian point of view
+eccc_to_commons.sh        transform ECCC XML files into JSON
+monthly_to_commons.xslt   transform ECCC monthly XML file into JSON
+
+
+Usage:
+The general idea of a large scale data import process is:
+ 1. Download a copy of all data required for the import
+ 2. Validate the cache
+ 3. Transform data into the target format
+ 4. Upload to the destination
+
+These tools require some technical knowledge prior to using them so they aren't
+for general use. This is however a good starting point to discover how large
+imports are processed in community driven projects.
+
+
+In practice:
+Beside a reasonable amount of available disk space, you will have to create two
+distinct folders: the first will contain a copy of ECCC downloaded data while
+the second will contain the data to be uploaded to Wikimedia Commons.
+The following section will refer to them as ${ECCC_CACHE} and ${COMMONS_CACHE}.
+These environment variables must be set or replaced by valid paths when the
+commands are used.
+
+
+1.  Download a copy of all data required for the import
+1.1 Create a list of all ECCC provided files
+First, we generate a list of all the historical data provided by ECCC.
+
+ $ ./dllist.sh "${ECCC_CACHE}" > "downloads_all"
+
+Expect quite long runtime. As of January 2020, it generates a list with almost
+650,000 download links.
+
+
+1.2 Filter unwanted files
+This long list may contain more files than you actually need, so you may want to
+reduce it so you don't have to download/store useless content.
+This step basically depends on your own needs, so not all cases will be covered
+here. downloads_all is a regular text file, so you can edit it with regular
+tools like sed, grep or your prefered interactive editor.
+
+Here are a few examples to inspire you:
+
+Keep only monthly data:
+  $ cat downloads_all | grep -B1 -A1 --no-group-separator \
+    -E '^output = ".*/monthly/[0-9]*.xml"$' > downloads_monthly
+
+Remove all downloads before (restart interrupted download):
+       $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
+         downloads_all > download_continue
+
+
+1.3 Download wanted files
+Give your own list of downloads to curl. You can add the parameters you need.
+
+  $ curl --fail-early --create-dirs -K download_all
+
+
+2 Fix the files
+Be aware the files you've downloaded is buggy. Yes, all of them, they're
+distributed as it by ECCC. But wait, there is a simple fix.
+
+The clean way to perform the fix is to use the following script:
+
+  $ ./eccc_fixer.sh "${ECCC_CACHE}" "${ECCC_CACHE}-fixed"
+
+
+However, if you don't want to keep the original files, you can just do:
+
+  $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec sed -i -e \
+    's/xsd:schemaLocation/xsi:schemaLocation/;s/xmlns:xsd="http:\/\/www.w3.org\/TR\/xmlschema-1\/"/xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance"/' {} +
+
+From now on, the guide expects "${ECCC_CACHE}" to point the directory with fixed
+files only.
+
+
+3.  Validate the cache
+It's important you make the effort to validate the files before processing them.
+Every transformation makes assumptions on data structure/content that can only
+be asserted by using proper validation schemes. Bypassing this step may lead to
+not working transformations or invalid final data.
+This step is split in two: first we have to check the data is valid from an ECCC
+point of view and then we check it's valid through Wikimedian eyes.
+
+3.1 Validate the data according to ECCC standards
+However, the XML schema distributed by ECCC is incorrect. It won't validate any XML
+coming from them. A fixed version can be found on Wikimedia Canada Git
+repositories.
+
+  $ git clone https://git.wikimedia.ca/eccc_schema.git
+  $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
+    -s eccc_schema/bulkschema.xsd {} \;
+
+The second command will list all incorrect files. If output is empty, you can
+continue.
+
+
+3.2 Validate the data according to Wikimedia standards
+
+  $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
+    -s commons_rules.xsd {} \;
+
+Same as previously, the output should be empty. Otherwise, you must resolve
+every single problem before continuing.
+
+
+4. Transform data into target format
+Here we are, here is the fun part: let's create weather data in Wikimedia
+Commons format.
+
+  $ ./eccc_to_commons "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
+
+It will replicate the future Commons content paths inside nested directories.
+So, for example future
+https://commons.wikimedia.org/wiki/Data:weather.gc.ca/Monthly/4271.tab resource
+will be created in ${COMMONS_CACHE}/weather.gc.ca/Monthly/4271.tab.
+A sum up log file is created for further reference on what has been done during
+conversion.
+
+
+5. Upload to destination
+Not done yet.