]> Wikimedia Canada | Git repositories - eccc_to_commons.git/blob - README
Fix indentation
[eccc_to_commons.git] / README
1 eccc_to_commons - Set of tools to replicate
2 Environment and Climate change Canada data on
3 Wikimedia Commons
4
5 This is a collection of scripts (mainly Bash and XSD/XSLT).
6
7 Most of them use standard Unix/GNU tools so it should work on any recent GNU
8 distribution. In addition to coreutils, prerequisites are:
9
10 - Bash 4+
11 - Curl
12 - Xmlstarlet
13 - Jq
14
15 This repository is sponsored by Environment and Climate change Canada and
16 Wikimedia Canada.
17
18
19 Provided scripts, ordered by chronological usage:
20 dllist.sh outputs a curl configuration file listing all available
21 ECCC data
22 eccc_fixer.sh fix upstream data XML files
23 eccc_fixer.xslt fix upstream data XML file
24 commons_rules.xsd validate ECCC XML from a Wikimedian point of view
25 eccc_to_commons.sh transform ECCC XML files into JSON
26 monthly_to_commons.xslt transform ECCC monthly XML file into JSON
27
28
29 Usage:
30 The general idea of a large scale data import process is:
31 1. Download a copy of all data required for the import
32 2. Validate the cache
33 3. Transform data into the target format
34 4. Upload to the destination
35
36 These tools require some technical knowledge prior to using them so they aren't
37 for general use. This is however a good starting point to discover how large
38 imports are processed in community driven projects.
39
40
41 In practice:
42 Beside a reasonable amount of available disk space, you will have to create two
43 distinct folders: the first will contain a copy of ECCC downloaded data while
44 the second will contain the data to be uploaded to Wikimedia Commons.
45 The following section will refer to them as ${ECCC_CACHE} and ${COMMONS_CACHE}.
46 These environment variables must be set or replaced by valid paths when the
47 commands are used.
48
49
50 1. Download a copy of all data required for the import
51 1.1 Create a list of all ECCC provided files
52 First, we generate a list of all the historical data provided by ECCC.
53
54 $ ./dllist.sh "${ECCC_CACHE}" > "downloads_all"
55
56 Expect quite long runtime. As of January 2020, it generates a list with almost
57 650,000 download links.
58
59
60 1.2 Filter unwanted files
61 This long list may contain more files than you actually need, so you may want to
62 reduce it so you don't have to download/store useless content.
63 This step basically depends on your own needs, so not all cases will be covered
64 here. downloads_all is a regular text file, so you can edit it with regular
65 tools like sed, grep or your prefered interactive editor.
66
67 Here are a few examples to inspire you:
68
69 Keep only monthly data:
70 $ cat downloads_all | grep -B1 -A1 --no-group-separator \
71 -E '^output = ".*/monthly/[0-9]*.xml"$' > downloads_monthly
72
73 Remove all downloads before (restart interrupted download):
74 $ sed -n '/https:\/\/climate.weather.gc.ca\/climate_data\/bulk_data_e.html?format=xml&timeframe=3&stationID=2606/,$p' \
75 downloads_all > download_continue
76
77
78 1.3 Download wanted files
79 Give your own list of downloads to curl. You can add the parameters you need.
80
81 $ curl --fail-early --create-dirs -K download_all
82
83
84 2 Fix the files
85 Be aware the files you've downloaded is buggy. Yes, all of them, they're
86 distributed as it by ECCC. But wait, there is a simple fix.
87
88 The clean way to perform the fix is to use the following script:
89
90 $ ./eccc_fixer.sh "${ECCC_CACHE}" "${ECCC_CACHE}-fixed"
91
92
93 However, if you don't want to keep the original files, you can just do:
94
95 $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec sed -i -e \
96 's/xsd:schemaLocation/xsi:schemaLocation/;s/xmlns:xsd="http:\/\/www.w3.org\/TR\/xmlschema-1\/"/xmlns:xsi="http:\/\/www.w3.org\/2001\/XMLSchema-instance"/' {} +
97
98 From now on, the guide expects "${ECCC_CACHE}" to point the directory with fixed
99 files only.
100
101
102 3. Validate the cache
103 It's important you make the effort to validate the files before processing them.
104 Every transformation makes assumptions on data structure/content that can only
105 be asserted by using proper validation schemes. Bypassing this step may lead to
106 not working transformations or invalid final data.
107 This step is split in two: first we have to check the data is valid from an ECCC
108 point of view and then we check it's valid through Wikimedian eyes.
109
110 3.1 Validate the data according to ECCC standards
111 However, the XML schema distributed by ECCC is incorrect. It won't validate any XML
112 coming from them. A fixed version can be found on Wikimedia Canada Git
113 repositories.
114
115 $ git clone https://git.wikimedia.ca/eccc_schema.git
116 $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
117 -s eccc_schema/bulkschema.xsd {} \;
118
119 The second command will list all incorrect files. If output is empty, you can
120 continue.
121
122
123 3.2 Validate the data according to Wikimedia standards
124
125 $ find "${ECCC_CACHE}" -type f -name '*.xml' -exec xmlstarlet val -b \
126 -s commons_rules.xsd {} \;
127
128 Same as previously, the output should be empty. Otherwise, you must resolve
129 every single problem before continuing.
130
131
132 4. Transform data into target format
133 Here we are, here is the fun part: let's create weather data in Wikimedia
134 Commons format.
135
136 $ ./eccc_to_commons "${ECCC_CACHE}" "${COMMONS_CACHE}" 2>log
137
138 It will replicate the future Commons content paths inside nested directories.
139 So, for example future
140 https://commons.wikimedia.org/wiki/Data:weather.gc.ca/Monthly/4271.tab resource
141 will be created in ${COMMONS_CACHE}/weather.gc.ca/Monthly/4271.tab.
142 A sum up log file is created for further reference on what has been done during
143 conversion.
144
145
146 5. Upload to destination
147 Not done yet.