X-Git-Url: https://git.wikimedia.ca/?p=eccc_to_commons.git;a=blobdiff_plain;f=README;h=198387187540ad0681a28ba1da727ee6758e4aa6;hp=ee13ba54500af8426a97fcf5c0d9619eaa7e8899;hb=HEAD;hpb=c4fdff03a380cd239b55e8d5f57817b967d1449d diff --git a/README b/README index ee13ba5..1983871 100644 --- a/README +++ b/README @@ -12,8 +12,7 @@ distribution. In addition to coreutils, prerequisites are: - Xmlstarlet - Jq -This repository is sponsored by Environment and Climate change Canada and -Wikimedia Canada. +This repository is sponsored by Wikimedia Canada. Provided scripts, ordered by chronological usage: @@ -22,9 +21,11 @@ dllist.sh outputs a curl configuration file listing all availabl eccc_fixer.sh fix upstream data XML files eccc_fixer.xslt fix upstream data XML file commons_rules.xsd validate ECCC XML from a Wikimedian point of view +eccc_merger.sh merge multiple ECCC XML files eccc_to_commons.sh transform ECCC XML files into JSON monthly_to_commons.xslt transform ECCC monthly XML file into JSON almanac_to_commons.xslt transform ECCC almanac XML file into JSON +mediawiki_post.sh upload directory to a Mediawiki Usage: @@ -72,8 +73,8 @@ Keep only monthly data: -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly Remove all downloads before (restart interrupted download): - $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \ - downloads_all > download_continue + $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \ + downloads_all > download_continue 1.3 Download wanted files @@ -130,11 +131,25 @@ Same as previously, the output should be empty. Otherwise, you must resolve every single problem before continuing. +[OPTIONAL STEP] Merge multiple XML files +Sometimes, having per station granularity is too accurate. If you need to merge +two or more XML files, you can use the eccc_merge.sh script: + + $ ./eccc_merger.sh "${ECCC_CACHE}/almanac/3050519.xml" \ + "${ECCC_CACHE}/almanac/3050520.xml" "${ECCC_CACHE}/almanac/3050521.xml" \ + "${ECCC_CACHE}/almanac/3050522.xml" "${ECCC_CACHE}/almanac/3050526.xml" \ + > banff.xml + +In order to get stations ids based on their geographical position, you can use +the eccc_map tool. A public instance is hosted online at +https://stations.wikimedia.ca/ . + + 4. Transform data into target format Here we are, here is the fun part: let's create weather data in Wikimedia Commons format. - $ ./eccc_to_commons "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log + $ ./eccc_to_commons.sh "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log It will replicate the future Commons content paths inside nested directories. So, for example future @@ -145,4 +160,11 @@ conversion. 5. Upload to destination -Not done yet. +It's now time to share our work with the world and that's the purpose of the +mediawiki_post.sh script. + + $ ./mediawiki_post.sh "${COMMONS_CACHE}" + +It takes the commons cache as parameter: its file hierarchy will be replicated +on commons. On first run, it will ask credentials for the Mediawiki account to use to +perform the import.