X-Git-Url: https://git.wikimedia.ca/?p=eccc_to_commons.git;a=blobdiff_plain;f=README;h=198387187540ad0681a28ba1da727ee6758e4aa6;hp=ca371dca55e57823b9fb61ca6608289c4649de8c;hb=HEAD;hpb=2f3682a6a85c816ba37855f0633478869334c529 diff --git a/README b/README index ca371dc..1983871 100644 --- a/README +++ b/README @@ -12,8 +12,7 @@ distribution. In addition to coreutils, prerequisites are: - Xmlstarlet - Jq -This repository is sponsored by Environment and Climate change Canada and -Wikimedia Canada. +This repository is sponsored by Wikimedia Canada. Provided scripts, ordered by chronological usage: @@ -22,8 +21,11 @@ dllist.sh outputs a curl configuration file listing all availabl eccc_fixer.sh fix upstream data XML files eccc_fixer.xslt fix upstream data XML file commons_rules.xsd validate ECCC XML from a Wikimedian point of view +eccc_merger.sh merge multiple ECCC XML files eccc_to_commons.sh transform ECCC XML files into JSON monthly_to_commons.xslt transform ECCC monthly XML file into JSON +almanac_to_commons.xslt transform ECCC almanac XML file into JSON +mediawiki_post.sh upload directory to a Mediawiki Usage: @@ -71,8 +73,8 @@ Keep only monthly data: -E '^output = ".*/monthly/[A-Z0-9]{7}.xml"$' > downloads_monthly Remove all downloads before (restart interrupted download): - $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \ - downloads_all > download_continue + $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \ + downloads_all > download_continue 1.3 Download wanted files @@ -129,11 +131,25 @@ Same as previously, the output should be empty. Otherwise, you must resolve every single problem before continuing. +[OPTIONAL STEP] Merge multiple XML files +Sometimes, having per station granularity is too accurate. If you need to merge +two or more XML files, you can use the eccc_merge.sh script: + + $ ./eccc_merger.sh "${ECCC_CACHE}/almanac/3050519.xml" \ + "${ECCC_CACHE}/almanac/3050520.xml" "${ECCC_CACHE}/almanac/3050521.xml" \ + "${ECCC_CACHE}/almanac/3050522.xml" "${ECCC_CACHE}/almanac/3050526.xml" \ + > banff.xml + +In order to get stations ids based on their geographical position, you can use +the eccc_map tool. A public instance is hosted online at +https://stations.wikimedia.ca/ . + + 4. Transform data into target format Here we are, here is the fun part: let's create weather data in Wikimedia Commons format. - $ ./eccc_to_commons "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log + $ ./eccc_to_commons.sh "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log It will replicate the future Commons content paths inside nested directories. So, for example future @@ -144,4 +160,11 @@ conversion. 5. Upload to destination -Not done yet. +It's now time to share our work with the world and that's the purpose of the +mediawiki_post.sh script. + + $ ./mediawiki_post.sh "${COMMONS_CACHE}" + +It takes the commons cache as parameter: its file hierarchy will be replicated +on commons. On first run, it will ask credentials for the Mediawiki account to use to +perform the import.