Rewrite almanach merge logic

[eccc_to_commons.git] / README
diff --git a/README b/README

index f6c6fd48d226008053938d38f17301ad7c608457..198387187540ad0681a28ba1da727ee6758e4aa6 100644 (file)
--- a/README
+++ b/README
@@ -12,8 +12,7 @@ distribution. In addition to coreutils, prerequisites are:
  - Xmlstarlet
  - Jq
  
  - Xmlstarlet
  - Jq
  
-This repository is sponsored by Environment and Climate change Canada and
-Wikimedia Canada.
+This repository is sponsored by Wikimedia Canada.
  
  
  Provided scripts, ordered by chronological usage:
  
  
  Provided scripts, ordered by chronological usage:
@@ -22,6 +21,7 @@ dllist.sh                 outputs a curl configuration file listing all availabl
  eccc_fixer.sh             fix upstream data XML files
  eccc_fixer.xslt           fix upstream data XML file
  commons_rules.xsd         validate ECCC XML from a Wikimedian point of view
  eccc_fixer.sh             fix upstream data XML files
  eccc_fixer.xslt           fix upstream data XML file
  commons_rules.xsd         validate ECCC XML from a Wikimedian point of view
+eccc_merger.sh            merge multiple ECCC XML files
  eccc_to_commons.sh        transform ECCC XML files into JSON
  monthly_to_commons.xslt   transform ECCC monthly XML file into JSON
  almanac_to_commons.xslt   transform ECCC almanac XML file into JSON
  eccc_to_commons.sh        transform ECCC XML files into JSON
  monthly_to_commons.xslt   transform ECCC monthly XML file into JSON
  almanac_to_commons.xslt   transform ECCC almanac XML file into JSON
@@ -73,8 +73,8 @@ Keep only monthly data:
      -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly
  
  Remove all downloads before (restart interrupted download):
      -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly
  
  Remove all downloads before (restart interrupted download):
-       $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
-         downloads_all > download_continue
+  $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
+    downloads_all > download_continue
  
  
  1.3 Download wanted files
  
  
  1.3 Download wanted files
@@ -131,11 +131,25 @@ Same as previously, the output should be empty. Otherwise, you must resolve
  every single problem before continuing.
  
  
  every single problem before continuing.
  
  
+[OPTIONAL STEP] Merge multiple XML files
+Sometimes, having per station granularity is too accurate. If you need to merge
+two or more XML files, you can use the eccc_merge.sh script:
+
+  $ ./eccc_merger.sh "${ECCC_CACHE}/almanac/3050519.xml" \
+    "${ECCC_CACHE}/almanac/3050520.xml" "${ECCC_CACHE}/almanac/3050521.xml" \
+    "${ECCC_CACHE}/almanac/3050522.xml" "${ECCC_CACHE}/almanac/3050526.xml" \
+    > banff.xml
+
+In order to get stations ids based on their geographical position, you can use
+the eccc_map tool. A public instance is hosted online at
+https://stations.wikimedia.ca/ .
+
+
  4. Transform data into target format
  Here we are, here is the fun part: let's create weather data in Wikimedia
  Commons format.
  
  4. Transform data into target format
  Here we are, here is the fun part: let's create weather data in Wikimedia
  Commons format.
  
-  $ ./eccc_to_commons "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
+  $ ./eccc_to_commons.sh "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
  
  It will replicate the future Commons content paths inside nested directories.
  So, for example future
  
  It will replicate the future Commons content paths inside nested directories.
  So, for example future