README

   1 eccc_to_commons - Set of tools to replicate
   2                   Environment and Climate change Canada data on
   3                   Wikimedia Commons
   4
   5 This is a collection of scripts (mainly Bash and XSD/XSLT).
   6
   7 Most of them use standard Unix/GNU tools so it should work on any recent GNU
   8 distribution. In addition to coreutils, prerequisites are:
   9
  10 - Bash 4+
  11 - Curl
  12 - Xmlstarlet
  13 - Jq
  14
  15 This repository is sponsored by Wikimedia Canada.
  16
  17
  18 Provided scripts, ordered by chronological usage:
  19 dllist.sh                 outputs a curl configuration file listing all available
  20                           ECCC data
  21 eccc_fixer.sh             fix upstream data XML files
  22 eccc_fixer.xslt           fix upstream data XML file
  23 commons_rules.xsd         validate ECCC XML from a Wikimedian point of view
  24 eccc_merger.sh            merge multiple ECCC XML files
  25 eccc_to_commons.sh        transform ECCC XML files into JSON
  26 monthly_to_commons.xslt   transform ECCC monthly XML file into JSON
  27 almanac_to_commons.xslt   transform ECCC almanac XML file into JSON
  28 mediawiki_post.sh         upload directory to a Mediawiki
  29
  30
  31 Usage:
  32 The general idea of a large scale data import process is:
  33  1. Download a copy of all data required for the import
  34  2. Validate the cache
  35  3. Transform data into the target format
  36  4. Upload to the destination
  37
  38 These tools require some technical knowledge prior to using them so they aren't
  39 for general use. This is however a good starting point to discover how large
  40 imports are processed in community driven projects.
  41
  42
  43 In practice:
  44 Beside a reasonable amount of available disk space, you will have to create two
  45 distinct folders: the first will contain a copy of ECCC downloaded data while
  46 the second will contain the data to be uploaded to Wikimedia Commons.
  47 The following section will refer to them as ${ECCC_CACHE} and ${COMMONS_CACHE}.
  48 These environment variables must be set or replaced by valid paths when the
  49 commands are used.
  50
  51
  52 1.  Download a copy of all data required for the import
  53 1.1 Create a list of all ECCC provided files
  54 First, we generate a list of all the historical data provided by ECCC.
  55
  56  $ ./dllist.sh "${ECCC_CACHE}" > "downloads_all"
  57
  58 Expect quite long runtime. As of January 2020, it generates a list with almost
  59 650,000 download links.
  60
  61
  62 1.2 Filter unwanted files
  63 This long list may contain more files than you actually need, so you may want to
  64 reduce it so you don't have to download/store useless content.
  65 This step basically depends on your own needs, so not all cases will be covered
  66 here. downloads_all is a regular text file, so you can edit it with regular
  67 tools like sed, grep or your prefered interactive editor.
  68
  69 Here are a few examples to inspire you:
  70
  71 Keep only monthly data:
  72   $ cat downloads_all | grep -B1 -A1 --no-group-separator \
  73     -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly
  74
  75 Remove all downloads before (restart interrupted download):
  76   $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
  77     downloads_all > download_continue
  78
  79
  80 1.3 Download wanted files
  81 Give your own list of downloads to curl. You can add the parameters you need.
  82
  83   $ curl --fail-early --create-dirs -K download_all
  84
  85
  86 2 Fix the files
  87 Be aware the files you've downloaded is buggy. Yes, all of them, they're
  88 distributed as it by ECCC. But wait, there is a simple fix.
  89
  90 The clean way to perform the fix is to use the following script:
  91
  92   $ ./eccc_fixer.sh "${ECCC_CACHE}" "${ECCC_CACHE}-fixed"
  93
  94
  95 However, if you don't want to keep the original files, you can just do:
  96
  97   $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec sed -i -e \
  98     's/xsd:schemaLocation/xsi:schemaLocation/;s/xmlns:xsd="http:\/\/www.w3.org\/TR\/xmlschema-1\/"/xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance"/' {} +
  99
 100 From now on, the guide expects "${ECCC_CACHE}" to point the directory with fixed
 101 files only.
 102
 103
 104 3.  Validate the cache
 105 It's important you make the effort to validate the files before processing them.
 106 Every transformation makes assumptions on data structure/content that can only
 107 be asserted by using proper validation schemes. Bypassing this step may lead to
 108 not working transformations or invalid final data.
 109 This step is split in two: first we have to check the data is valid from an ECCC
 110 point of view and then we check it's valid through Wikimedian eyes.
 111
 112 3.1 Validate the data according to ECCC standards
 113 However, the XML schema distributed by ECCC is incorrect. It won't validate any XML
 114 coming from them. A fixed version can be found on Wikimedia Canada Git
 115 repositories.
 116
 117   $ git clone https://git.wikimedia.ca/eccc_schema.git
 118   $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
 119     -s eccc_schema/bulkschema.xsd {} \;
 120
 121 The second command will list all incorrect files. If output is empty, you can
 122 continue.
 123
 124
 125 3.2 Validate the data according to Wikimedia standards
 126
 127   $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
 128     -s commons_rules.xsd {} \;
 129
 130 Same as previously, the output should be empty. Otherwise, you must resolve
 131 every single problem before continuing.
 132
 133
 134 [OPTIONAL STEP] Merge multiple XML files
 135 Sometimes, having per station granularity is too accurate. If you need to merge
 136 two or more XML files, you can use the eccc_merge.sh script:
 137
 138   $ ./eccc_merger.sh "${ECCC_CACHE}/almanac/3050519.xml" \
 139     "${ECCC_CACHE}/almanac/3050520.xml" "${ECCC_CACHE}/almanac/3050521.xml" \
 140     "${ECCC_CACHE}/almanac/3050522.xml" "${ECCC_CACHE}/almanac/3050526.xml" \
 141     > banff.xml
 142
 143 In order to get stations ids based on their geographical position, you can use
 144 the eccc_map tool. A public instance is hosted online at
 145 https://stations.wikimedia.ca/ .
 146
 147
 148 4. Transform data into target format
 149 Here we are, here is the fun part: let's create weather data in Wikimedia
 150 Commons format.
 151
 152   $ ./eccc_to_commons.sh "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
 153
 154 It will replicate the future Commons content paths inside nested directories.
 155 So, for example future
 156 https://commons.wikimedia.org/wiki/Data:weather.gc.ca/Monthly/4271.tab resource
 157 will be created in ${COMMONS_CACHE}/weather.gc.ca/Monthly/4271.tab.
 158 A sum up log file is created for further reference on what has been done during
 159 conversion.
 160
 161
 162 5. Upload to destination
 163 It's now time to share our work with the world and that's the purpose of the
 164 mediawiki_post.sh script.
 165
 166   $ ./mediawiki_post.sh "${COMMONS_CACHE}"
 167
 168 It takes the commons cache as parameter: its file hierarchy will be replicated
 169 on commons. On first run, it will ask credentials for the Mediawiki account to use to
 170 perform the import.