Rewrite almanach merge logic

[eccc_to_commons.git] / README
diff --git a/README b/README

index 9c0fe34c404fb0b2335bc3b05c196c36517adc2b..198387187540ad0681a28ba1da727ee6758e4aa6 100644 (file)
--- a/README
+++ b/README
@@ -12,8 +12,7 @@ distribution. In addition to coreutils, prerequisites are:
  - Xmlstarlet
  - Jq
  
-This repository is sponsored by Environment and Climate change Canada and
-Wikimedia Canada.
+This repository is sponsored by Wikimedia Canada.
  
  
  Provided scripts, ordered by chronological usage:
@@ -22,8 +21,11 @@ dllist.sh                 outputs a curl configuration file listing all availabl
  eccc_fixer.sh             fix upstream data XML files
  eccc_fixer.xslt           fix upstream data XML file
  commons_rules.xsd         validate ECCC XML from a Wikimedian point of view
+eccc_merger.sh            merge multiple ECCC XML files
  eccc_to_commons.sh        transform ECCC XML files into JSON
  monthly_to_commons.xslt   transform ECCC monthly XML file into JSON
+almanac_to_commons.xslt   transform ECCC almanac XML file into JSON
+mediawiki_post.sh         upload directory to a Mediawiki
  
  
  Usage:
@@ -68,11 +70,11 @@ Here are a few examples to inspire you:
  
  Keep only monthly data:
    $ cat downloads_all | grep -B1 -A1 --no-group-separator \
-    -E '^output = ".*/monthly/[0-9]*.xml"$' > downloads_monthly
+    -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly
  
  Remove all downloads before (restart interrupted download):
-       $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
-         downloads_all > download_continue
+  $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
+    downloads_all > download_continue
  
  
  1.3 Download wanted files
@@ -129,11 +131,25 @@ Same as previously, the output should be empty. Otherwise, you must resolve
  every single problem before continuing.
  
  
+[OPTIONAL STEP] Merge multiple XML files
+Sometimes, having per station granularity is too accurate. If you need to merge
+two or more XML files, you can use the eccc_merge.sh script:
+
+  $ ./eccc_merger.sh "${ECCC_CACHE}/almanac/3050519.xml" \
+    "${ECCC_CACHE}/almanac/3050520.xml" "${ECCC_CACHE}/almanac/3050521.xml" \
+    "${ECCC_CACHE}/almanac/3050522.xml" "${ECCC_CACHE}/almanac/3050526.xml" \
+    > banff.xml
+
+In order to get stations ids based on their geographical position, you can use
+the eccc_map tool. A public instance is hosted online at
+https://stations.wikimedia.ca/ .
+
+
  4. Transform data into target format
  Here we are, here is the fun part: let's create weather data in Wikimedia
  Commons format.
  
-  $ ./eccc_to_commons "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
+  $ ./eccc_to_commons.sh "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
  
  It will replicate the future Commons content paths inside nested directories.
  So, for example future
@@ -144,4 +160,11 @@ conversion.
  
  
  5. Upload to destination
-Not done yet.
+It's now time to share our work with the world and that's the purpose of the
+mediawiki_post.sh script.
+
+  $ ./mediawiki_post.sh "${COMMONS_CACHE}"
+
+It takes the commons cache as parameter: its file hierarchy will be replicated
+on commons. On first run, it will ask credentials for the Mediawiki account to use to
+perform the import.