The RapidAI4EO corpus is a large dataset of analysis ready satellite time series imagery originally designed for land use and land cover (LULC) classification and change detection, but applicable to a wide range of research topics in remote sensing and machine learning. The corpus is sampled at 500,000 locations across Europe, with a time series of coincident Planet Fusion and Sentinel-2 imagery and multiclass LULC labels derived from the 2018 CORINE Land Cover (CLC) inventory, available at each location. For more details, refer to the corpus documentation.
This tutorial covers some basic aspects of accessing the corpus. Altogether, it contains 79 million image chips in addition to metadata, labels, and other resources. This comes to some 17TB of data. As such, the main goal of the tutorial is to provide patterns for efficiently querying and downloading subsets of the data relevant to researchers who do not require the corpus in its entirety.
Three types of queries, which can also be combined, will be covered here:
In Section 3 Querying with STAC we demonstrate the first two types of queries using the SpatioTemporal Asset Catalog (STAC) standard. As described in further detail there, the corpus has a lot of redundant metadata that can make some STAC queries rather slow. In Section 4 Optimized queries we will look at querying of the corpus by using files that store the spatial metadata and labels in a centralized and non-redundant format. In both these sections we are identifying the files that we want to download by their URLs. In Section 5 Downloading subsets with azcopy we look at one pattern for downloading from such a list of selected files.
Before we begin, we can install the dependencies of the notebook by uncommenting and running the following command:
# !pip install geopandas pandas pystac shapely
Caution: As detailed in the introduction, the corpus is voluminous. If you are only interested in subsets of the data, jump to the following sections.
For large downloads, it is recommended to use AzCopy. The entire corpus can be transferred using the following command:
azcopy copy --recursive 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/' $DESTINATION
The corpus is released with an accompanying static SpatioTemporal Asset Catalog (STAC).
STAC is a flexible and provider-agnostic standard for indexing Earth observation imagery.
It comes with a rich ecosystem of tools for accessing data.
In this tutorial we will query data with pystac
, however additional tools, including those for languages other than Python or as software plugins, are also available.
Users can browse and familiarize themselves with the RapidAI4EO STAC using Radiant's STAC Browser user interface.
A more detailed description of the RapidAI4EO STAC structure can be found in the "Structure" section of the corpus documentation.
A basic level of familiarity with the STAC specification is recommended before proceeding with this section.
Before we jump in, it bears mentioning that our corpus and therefore its STAC contain a lot of redundant metadata. At each of the 500,000 locations in the corpus, we have coincident imagery products (Planet Fusion and Sentinel-2) as well as CLC multiclass labels. We also know that the time series of like products have the same timesteps at all locations. If we aren't careful about how we query the STAC, this will lead to a lot of redundant spatio-temporal comparisons. In the section Optimized queries we use some of this knowledge to find the Assets we're looking for more efficiently.
The additional latency induced by retrieving STAC files via HTTPS could be improved by first downloading the STAC and querying it from the local instance, however be aware that the STAC itself is also quite voluminous (approximately 38GB).
import pandas as pd
from pystac import Catalog, Item
import shapely.geometry
catalog_uri = (
"https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/stac-v1.0/catalog.json"
)
catalog = Catalog.from_file(catalog_uri)
for collection in catalog.get_collections():
print(collection.id)
rapidai4eo_v1_source_pf rapidai4eo_v1_source_s2 rapidai4eo_v1_labels
When we list the Collections in our Catalog, we see that there is one for each type of product: Planet Fusion imagery (shorthand: pf), Sentinel-2 imagery (s2), and the CLC multiclass labels (labels).
Now, let's write a function that crawls a STAC Collection and returns all Items intersecting a given geometry. We will search a single Collection, rather than all Collections in the Catalog, because we know that we have matching products at each location. Running the same spatial query over multiple Collections would be redundant, whereas we can use the links in the STAC Items to find the matching Items for other products at the same location. Note that STAC always uses EPSG:4326 geometries.
def items_intersecting_geometry(collection, geometry):
"""Recursively find all STAC items intersecting a geometry.
Our STAC structure (further detailed in the corpus documentation) has a hierarchy of
Collections to speed up spatial queries. Recursively search these Collections, and
return all STAC Items that intersect our geometry.
"""
intersecting_items = []
collection_bboxes = [
shapely.geometry.box(*bounds) for bounds in collection.extent.spatial.bboxes
]
if any([bbox.intersects(geometry) for bbox in collection_bboxes]):
# Collect all matching items in this collection
for item in collection.get_items():
item_geometry = shapely.geometry.shape(item.geometry)
if item_geometry.intersects(geometry):
intersecting_items.append(item)
# Recursively search our nested collections for items
for subcollection in collection.get_collections():
intersecting_items += items_intersecting_geometry(subcollection, geometry)
return intersecting_items
For our example query, we will use a rough bounding box around Berlin as our search geometry and we will query the Planet Fusion STAC Collection.
berlin_bbox = shapely.geometry.box(13.05, 52.35, 13.72, 52.69)
pf_collection = catalog.get_child("rapidai4eo_v1_source_pf")
%%time
berlin_pf_items = items_intersecting_geometry(pf_collection, berlin_bbox)
print(f"Found {len(berlin_pf_items)} in the vicinity of Berlin.")
Found 198 in the vicinity of Berlin. CPU times: user 30.3 s, sys: 2.08 s, total: 32.4 s Wall time: 2min 58s
The search time is approaching three minutes, despite querying against a small bounding box. Crawling over the STAC like this is not recommended for large geometries.
Our spatial query has returned 198 Planet Fusion Items. Within our STAC structure, an Item represents all timesteps for one product at one location. Within the Item, the actual files are referenced as Assets.
Let's take a look at a few Assets to get a feel for them.
example_pf_item = berlin_pf_items[0]
example_pf_assets = example_pf_item.get_assets()
n_head = 5
for i, (asset_name, asset) in enumerate(example_pf_assets.items()):
print(f"Asset: {asset_name}:")
print(asset.to_dict())
print("---")
if i >= n_head - 1:
break
Asset: sr_2018-01-03: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-01-03.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-03T10:30:00Z', 'roles': ['data']} --- Asset: qa_2018-01-03: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-QA/2018-01-03.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-03T10:30:00Z', 'roles': ['metadata']} --- Asset: sr_2018-01-08: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-01-08.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-08T10:30:00Z', 'roles': ['data']} --- Asset: qa_2018-01-08: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-QA/2018-01-08.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-08T10:30:00Z', 'roles': ['metadata']} --- Asset: sr_2018-01-13: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-01-13.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-13T10:30:00Z', 'roles': ['data']} ---
Assets are identified by a key that is unique per Item.
Identifying the timestep to which an Asset belongs can be achieved by looking at the Asset's datetime property, in the case of pystac by accessing asset.extra_fields['datetime']
.
An Asset also has roles, in the case of this corpus either "data" or "metadata". "Data" Assets reference the image chips themselves, whereas Planet Fusion also has pixelwise quality assurance masks linked as "metadata" assets.1 Note that the Asset keys were also designed to provide information on the type of file ("sr" for surface reflectance image chip or "qa" for quality assurance masks) and timestep. This allows us to quickly select the Assets we are interested in if we know these criteria.
Each Asset also provides the absolute URL to download the file via the href
property.
Let's look at a some toy examples for filtering Assets.
1 Sentinel-2 metadata in the form of traceability manifests are stored at the tile Collection level. For more details, refer to the corpus documentation.
# Define the time period for filtering
start_date = pd.to_datetime("2018-05-01T00:00:00Z")
end_date = pd.to_datetime("2018-06-01T00:00:00Z")
# Filter the assets based on the defined time period
time_filtered_assets = {
k: v
for k, v in example_pf_assets.items()
if start_date < pd.to_datetime(v.extra_fields["datetime"]) < end_date
}
# Print the number of assets and their names within the time period
print(f"{len(time_filtered_assets)} Assets in time period {start_date} - {end_date}:")
for asset_name, _ in time_filtered_assets.items():
print(f" {asset_name}.")
# Define the role for further filtering
role = "data"
# Filter the time-filtered assets based on the defined role
role_filtered_assets = {
k: v for k, v in time_filtered_assets.items() if v.has_role(role)
}
print(f'Of which, {len(role_filtered_assets)} have the role "data".')
for asset_name, _ in role_filtered_assets.items():
print(f" {asset_name}.")
12 Assets in time period 2018-05-01 00:00:00+00:00 - 2018-06-01 00:00:00+00:00: sr_2018-05-03. qa_2018-05-03. sr_2018-05-08. qa_2018-05-08. sr_2018-05-13. qa_2018-05-13. sr_2018-05-18. qa_2018-05-18. sr_2018-05-23. qa_2018-05-23. sr_2018-05-28. qa_2018-05-28. Of which, 6 have the role "data". sr_2018-05-03. sr_2018-05-08. sr_2018-05-13. sr_2018-05-18. sr_2018-05-23. sr_2018-05-28.
And we can also list the URLs of the files for download, which we could then download using azcopy
(see Downloading subsets with azcopy), the Azure Storage SDK, or via HTTPS calls.
print("URLs of filtered Assets:")
for asset_name, asset in role_filtered_assets.items():
print(asset.href)
URLs of filtered Assets: https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-03.tif https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-08.tif https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-13.tif https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-18.tif https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-23.tif https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-28.tif
So far our examples have just iterated over a single example Item. We could iterate over the rest of our Items, running the same or other queries, to generate a list of all relevant Assets in our original spatial query. If we want to reuse the same temporal and role filtering, we can also rely on the knowledge that each location has the same timesteps and retrieve the corresponding Assets by reusing the keys our first temporal and role query had returned.
query_pf_hrefs = []
pf_asset_query_keys = [*role_filtered_assets.keys()]
for item in berlin_pf_items:
for k in pf_asset_query_keys:
query_pf_hrefs.append(item.assets[k].href)
print(f"Found {len(query_pf_hrefs):,} matching URLs.")
Found 1,188 matching URLs.
We have been filtering within the Planet Fusion Collection only, but if we can also tie together the corresponding Sentinel-2 images and CLC labels by using the links within the Planet Fusion Items, specifically the links with relation type "related".
Let's take a first look at our single example_item
again to see how these relations work.
related_items = [
link.resolve_stac_object().target
for link in example_pf_item.links
if link.rel == "related"
]
for item in related_items:
print(f"Found related item in Collection {item.collection_id}.")
Found related item in Collection rapidai4eo_v1_labels. Found related item in Collection rapidai4eo_v1_source_s2.
When looking at the related Items there a few ways of getting information about what the Item represents (such as item.properties['title']
), but the one that will be most convenient to filter by, in the case we only want one or the other product type, is item.collection_id
.
As we know that there is a unique Collection per product type, we could filter the related Items by those that belong to a specific collection.
We've looked at Planet Fusion Assets, now let's also inspect Sentinel-2 and CLC labels examples.
for item in related_items:
print(f"Found related item in Collection {item.collection_id}.")
for i, (asset_name, asset) in enumerate(item.assets.items()):
print(f"Asset: {asset_name}:")
print(asset.to_dict())
print("---")
if i >= n_head - 1:
break
print("---")
Found related item in Collection rapidai4eo_v1_labels. Asset: labels: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/labels/33N/16E-243N/labels_33N_16E-243N_29_01.geojson', 'type': 'application/geo+json', 'roles': ['labels', 'labels-vector']} --- --- Found related item in Collection rapidai4eo_v1_source_s2. Asset: s2_2018-01: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-01.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-15T10:30:00Z', 'roles': ['data']} --- Asset: s2_2018-02: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-02.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-02-15T10:30:00Z', 'roles': ['data']} --- Asset: s2_2018-03: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-03.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-03-15T10:30:00Z', 'roles': ['data']} --- Asset: s2_2018-04: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-04.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-04-15T10:30:00Z', 'roles': ['data']} --- Asset: s2_2018-05: {'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-05.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-05-15T10:30:00Z', 'roles': ['data']} --- ---
We see here that the corresponding labels Item has a single Asset pointing to a GeoJSON file, which will contain the CLC multiclass labels. The Sentinel-2 Item has one Asset for each of the monthly image mosiacs. Tying together some of the steps above, let's look at filtering all Planet Fusion and Sentinel-2 Assets.
query_pfs2_hrefs = []
pf_asset_query_keys = [*role_filtered_assets.keys()]
s2_asset_query_keys = None
for pf_item in berlin_pf_items:
# Filter Planet Fusion Assets
for k in pf_asset_query_keys:
query_pfs2_hrefs.append(pf_item.assets[k].href)
# Filter Sentinel-2 Assets related to that Planet Fusion Item (i.e. is sampled at the same location)
related_items = [
link.resolve_stac_object().target
for link in example_pf_item.links
if link.rel == "related"
]
for related_item in related_items:
# Filter down to s2 items only (exclude labels for this example)
if related_item.collection_id == "rapidai4eo_v1_source_s2":
if s2_asset_query_keys is None:
# This is our first look at s2 Assets, so the keys we want to access have not
# yet been defined
s2_asset_query_keys = [
k
for k, v in related_item.assets.items()
if start_date
< pd.to_datetime(v.extra_fields["datetime"])
< end_date
]
# NOTE "role" filtering to only "data" is not necessary in the case of Sentinel-2
# as no metadata files are linked to the STAC Items.
for k in s2_asset_query_keys:
query_pfs2_hrefs.append(related_item.assets[k].href)
print(f"Found {len(query_pfs2_hrefs):,} matching URLs.")
Found 1,386 matching URLs.
We have seen some approaches to querying the STAC and applied some of our knowledge about redundant metadata in the corpus to avoid redundantly accessing remote STAC files. The patterns applied here can be adapted to filter the data by any spatio-temporal criteria. Crawling a static STAC can be slow, and our workarounds to avoid redundant filtering are not always straightforward. In the next section we will look at using some resource files to run similar queries in a faster and more straightforward manner.
As mentioned above we can leverage our knowledge about the redundant nature of metadata in the corpus to improve query times, albeit by working somewhat outside the STAC standard. To do this we will use three files that give us an overview of the metadata in the corpus:
All three of these files should be downloaded for the below examples to run (see the second following code block to download these via cURL). If not downloading to the same directory as this notebook, or if the files are renamed, update the variables in the cell below.
import gzip
import geopandas as gpd
import pandas as pd
import shapely.geometry
geometries_file = "rapidai4eo_geometries.geojson.gz"
labels_file = "rapidai4eo_labels.csv.gz"
labels_mapping_file = "rapidai4eo_label_mappings.csv"
geometries_file_url = "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/resources/rapidai4eo_geometries.geojson.gz"
labels_file_url = "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/resources/rapidai4eo_labels.csv.gz"
labels_mapping_file_url = "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/resources/rapidai4eo_label_mappings.csv"
Optionally, download the resource files via cURL.
!curl {geometries_file_url} -o {geometries_file}
!curl {labels_file_url} -o {labels_file}
!curl {labels_mapping_file_url} -o {labels_mapping_file}
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 15.0M 100 15.0M 0 0 20.3M 0 --:--:-- --:--:-- --:--:-- 20.3M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 8520k 100 8520k 0 0 12.2M 0 --:--:-- --:--:-- --:--:-- 12.2M % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1759 100 1759 0 0 13126 0 --:--:-- --:--:-- --:--:-- 13029
%%time
with gzip.open(geometries_file) as fs:
geometries = gpd.read_file(fs).set_index("sample_id")
labels = pd.read_csv(labels_file).set_index("sample_id")
labels_mapping = pd.read_csv(labels_mapping_file).set_index("index")
print("Geometries:")
print(geometries.head())
print("---")
print("Labels (columns truncated):")
print(labels[labels.columns[:6]].head())
print("---")
Geometries: geometry sample_id 35N_19E-292N_23_31 POLYGON ((26.49348 63.29178, 26.49338 63.29717... 35N_19E-292N_01_05 POLYGON ((26.17905 63.40880, 26.17889 63.41418... 35N_19E-292N_17_03 POLYGON ((26.15755 63.32251, 26.15739 63.32789... 35N_19E-292N_05_11 POLYGON ((26.25168 63.38765, 26.25154 63.39304... 35N_19E-292N_36_28 POLYGON ((26.45890 63.22165, 26.45880 63.22704... --- Labels (columns truncated): clc_111 clc_112 clc_121 clc_122 clc_123 clc_124 sample_id 31N_20E-211N_16_38 0.0 0.000000 0.0 0.0 0.0 0.0 35N_24E-173N_23_37 0.0 0.269575 0.0 0.0 0.0 0.0 32N_20E-205N_28_20 0.0 0.254225 0.0 0.0 0.0 0.0 33N_24E-237N_23_27 0.0 0.000000 0.0 0.0 0.0 0.0 33N_27E-224N_33_23 0.0 0.000000 0.0 0.0 0.0 0.0 --- CPU times: user 29.5 s, sys: 421 ms, total: 30 s Wall time: 30 s
Each sample location within the corpus has a unique ID, which is integral to building the IDs of Items within the STAC and the URLs of the Assets.
For example, the sample location with ID 35N_19E-292N_23_31
will correspond to a Planet Fusion Item (rapidai4eo_v1_source_pf_35N_19E-292N_23_31
), a Sentinel-2 Item (rapidai4eo_v1_source_s2_35N_19E-292N_23_31
), and a label Item (rapidai4eo_v1_labels_35N_19E-292N_23_31
).
Similarly, given a sample ID, a product type, and—in the case of our image products—a valid date we can build the URLs to any Asset in the corpus, as described in the Appendix.
We will rely on these files to filter by geometries or CLC class and apply our knowledge of the timesteps available in the corpus to quickly query Assets.
Let us begin by running the same query over Berlin we had run above.
%%time
berlin_bbox = shapely.geometry.box(13.05, 52.35, 13.72, 52.69)
spatially_filtered_ids = geometries[geometries.geometry.intersects(berlin_bbox)].index
print(f"{spatially_filtered_ids.size} IDs found in the vicinity of Berlin.")
print("---")
198 IDs found in the vicinity of Berlin. --- CPU times: user 123 ms, sys: 12 ms, total: 135 ms Wall time: 134 ms
As we can see, the spatial query returned in a fraction of the time it takes to crawl a STAC Collection. When querying against larger geometries, say, an entire country, this will make a difference from hours to minutes.
What we've identified so far are the IDs of locations where we have Assets we will want to download.
The next step is to build the URLs to those assets so that we can download them.
To achieve this, we'll lean on the get_asset_hrefs
function in the rapidai4eo.py
file provided with this tutorial (download that file to the same directory as the notebook if you have not already).
This class takes the busywork out of transforming a list of IDs into a list of hrefs.
If we just pass it a list of IDs, it will return URLs for all products (Planet Fusion imagery, Sentinel-2 imagery, and Planet Fusion quality assurance masks) and all available timesteps at those locations.
We can optionally pass a list of products or a start and end date as a tuple to filter by those criteria.
from rapidai4eo import get_asset_hrefs, PRODUCTS
print(PRODUCTS)
['pfsr', 'pfqa', 's2']
We see here that we have the following product keys: 'pfsr' for Planet Fusion imagery, 'pfqa' for Planet Fusion quality assurance masks, and 's2' for Sentinel-2 imagery.
hrefs = get_asset_hrefs(spatially_filtered_ids)
print(
f"{len(hrefs):,} image chips and quality assurance masks in our area of interest."
)
hrefs = get_asset_hrefs(spatially_filtered_ids, products=["pfsr", "s2"])
print(
f"{len(hrefs):,} image chips in our area of interest, ignoring quality assurance masks"
)
hrefs = get_asset_hrefs(spatially_filtered_ids, products="s2")
print(f"{len(hrefs):,} Sentinel-2 image chips only in our area of interest.")
start_date = pd.to_datetime("2018-01-01T00:00:00Z")
end_date = pd.to_datetime("2019-01-01T00:00:00Z")
hrefs = get_asset_hrefs(
spatially_filtered_ids,
products=["pfsr", "s2"],
temporal_filter=(start_date, end_date),
)
print(
f"{len(hrefs):,} Planet Fusion and Sentinel-2 image chips in our area of interest and "
"filtering by the time period for which both products are available (2018)."
)
60,192 image chips and quality assurance masks in our area of interest. 31,284 image chips in our area of interest, ignoring quality assurance masks 2,376 Sentinel-2 image chips only in our area of interest. 16,830 Planet Fusion and Sentinel-2 image chips in our area of interest and filtering by the time period for which both products are available (2018).
In the above examples we applied a spatial query to identify the relevant sample IDs, but if we are interested in certain CLC classes we could also identify sample IDs from the labels file.
First let's take a look at our labels mapping DataFrame
, which allows us to map from the names of the classes ("class_name") to the corresponding column in our labels DataFrame
(via "index").
labels_mapping
clc_level1_id | clc_level2_id | clc_level3_id | class_name | |
---|---|---|---|---|
index | ||||
clc_111 | 1 | 11 | 111 | Continuous urban fabric |
clc_112 | 1 | 11 | 112 | Discontinuous urban fabric |
clc_121 | 1 | 12 | 121 | Industrial or commercial units |
clc_122 | 1 | 12 | 122 | Road and rail networks and associated land |
clc_123 | 1 | 12 | 123 | Port areas |
clc_124 | 1 | 12 | 124 | Airports |
clc_131 | 1 | 13 | 131 | Mineral extraction sites |
clc_132 | 1 | 13 | 132 | Dump sites |
clc_133 | 1 | 13 | 133 | Construction sites |
clc_141 | 1 | 14 | 141 | Green urban areas |
clc_142 | 1 | 14 | 142 | Sport and leisure facilities |
clc_211 | 2 | 21 | 211 | Non-irrigated arable land |
clc_212 | 2 | 21 | 212 | Permanently irrigated land |
clc_213 | 2 | 21 | 213 | Rice fields |
clc_221 | 2 | 22 | 221 | Vineyards |
clc_222 | 2 | 22 | 222 | Fruit trees and berry plantations |
clc_223 | 2 | 22 | 223 | Olive groves |
clc_231 | 2 | 23 | 231 | Pastures |
clc_241 | 2 | 24 | 241 | Annual crops associated with permanent crops |
clc_242 | 2 | 24 | 242 | Complex cultivation patterns |
clc_243 | 2 | 24 | 243 | Land principally occupied by agriculture with ... |
clc_244 | 2 | 24 | 244 | Agro-forestry areas |
clc_311 | 3 | 31 | 311 | Broad-leaved forest |
clc_312 | 3 | 31 | 312 | Coniferous forest |
clc_313 | 3 | 31 | 313 | Mixed forest |
clc_321 | 3 | 32 | 321 | Natural grasslands |
clc_322 | 3 | 32 | 322 | Moors and heathland |
clc_323 | 3 | 32 | 323 | Sclerophyllous vegetation |
clc_324 | 3 | 32 | 324 | Transitional woodland-shrub |
clc_331 | 3 | 33 | 331 | Beaches dunes sands |
clc_332 | 3 | 33 | 332 | Bare rocks |
clc_333 | 3 | 33 | 333 | Sparsely vegetated areas |
clc_334 | 3 | 33 | 334 | Burnt areas |
clc_335 | 3 | 33 | 335 | Glaciers and perpetual snow |
clc_411 | 4 | 41 | 411 | Inland marshes |
clc_412 | 4 | 41 | 412 | Peat bogs |
clc_421 | 4 | 42 | 421 | Salt marshes |
clc_422 | 4 | 42 | 422 | Salines |
clc_423 | 4 | 42 | 423 | Intertidal flats |
clc_511 | 5 | 51 | 511 | Water courses |
clc_512 | 5 | 51 | 512 | Water bodies |
clc_521 | 5 | 52 | 521 | Coastal lagoons |
clc_522 | 5 | 52 | 522 | Estuaries |
clc_523 | 5 | 52 | 523 | Sea and ocean |
clc_999 | 9 | 99 | 999 | NODATA |
Let's say we are interested only in peat bogs for some research topic.
Looking at the mappings, we can see that the portion of a sample locations covered by peat bog according to the 2018 CLC will be represented in our labels DataFrame
by the column "clc_412".
If we want to find all sample locations that are at least 80% covered by peat bog, we filter that column by values greater than 0.8.
peat_bog_samples = labels[labels["clc_412"] > 0.8].index
print(f"{peat_bog_samples.size:,}")
2,753
Now we can select the image chip URLs that we want to download in the same pattern we had used above, for example:
hrefs = get_asset_hrefs(peat_bog_samples, products="pfsr")
print(f"{len(hrefs):,} Planet Fusion image chips over peat bogs.")
401,938 Planet Fusion image chips over peat bogs.
Let's tie this together with a final example. For this case we will search for all image chips within Germany that are entirely covered by forest, also applying a temporal filter to restrict our results to the time period for which we have both image types (2018). We will read the geometry for Germany from Eruostat's NUTS dataset. In the case of forest classes, there are three level-3 entries in the CLC ontology: coniferous, deciduous, and mixed. For the labels query, we will combine these three classes by summation.
eu_countries_file = (
"https://gisco-services.ec.europa.eu/distribution/v2/nuts/topojson/"
"NUTS_RG_20M_2021_4326_LEVL_0.json"
)
eu_countries = gpd.read_file(eu_countries_file).set_index("id")
germany_geometry = eu_countries.loc["DE", "geometry"]
%%time
# Filter by our country geometry
germany_samples = geometries[geometries.geometry.within(germany_geometry)].index
# At level-3, the CLC has three forest classes which all belong
# to CLC level-2 class 31.
forest_classes = labels_mapping[labels_mapping.clc_level2_id == 31].index.values
# Filter all available labels by our above spatial filter, keeping
# only relevant (forest class) columns
germany_labels = labels.loc[germany_samples][forest_classes]
# And now we can get the proportion of each location covered by any forest
# class simply by summing across columns.
germany_forest_label = germany_labels.sum(axis=1)
# Let's say we now want only those locations entirely covered by forest
germany_forest_samples = germany_forest_label[germany_forest_label == 1.0].index
print(
f"Found {germany_forest_samples.size:,} samples in Germany entirely covered by forest (CLC class 31)."
)
# And finally we can select the URLs to download any product types we want
# Let's say again we want both image types, without the QA files, for the
# calendar year 2019.
start_date = pd.to_datetime("2018-01-01T00:00:00Z")
end_date = pd.to_datetime("2019-01-01T00:00:00Z")
hrefs = get_asset_hrefs(
germany_forest_samples,
products=["pfsr", "s2"],
temporal_filter=(start_date, end_date),
)
print(
f"{len(hrefs):,} Planet Fusion and Sentinel-2 image chips in our area of interest and "
"filtering by the time period for which both products are available (2018)."
)
Found 3,214 samples in Germany entirely covered by forest (CLC class 31). 273,190 Planet Fusion and Sentinel-2 image chips in our area of interest and filtering by the time period for which both products are available (2018). CPU times: user 4.55 s, sys: 15 ms, total: 4.56 s Wall time: 4.56 s
For this example query we now have a list of 273,190 image Assets, representing the time series for both Sentinel-2 and Planet Fusion at 3,214 locations across Germany covered entirely by forest. The same query run against the static STAC would have been prohibitively slow, but here it runs in about five seconds.
As mentioned above, now that we have a list of URLs we can download the matching files at will.
Two approaches not covered in this tutorial would be to script the downloads leveraging the Azure Storage SDK or python-requests
.
In the next section we look at using AzCopy to download files from a list.
If we save a list of file URLs we want to download, as identified in the patterns from the above examples, we can pass those to an azcopy
command.
azcopy copy 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/' my/destination/directory/ --list-of-files="files_of_interest.txt" --include-directory-stub
where
--list-of-files
is where we pass the text file in which we've saved the file URLs we want to download; and--include-directory-stub
flag creates subdirectories in our download destination to match those in our source, which prevents file name collisions.Be aware that the AzCopy wiki makes the following statement about the --list-of-files
flag:
As of version 10.3, this flag remains undocumented (except for this wiki page) and it does not appear in the in-app command line help. It is however regularly used by Azure Storage Explorer to pass file lists to its embedded copy of AzCopy.
(The reason that it's undocumented is that, at high file counts, it has slow performance for downloads and service-to-service copies. We want to fix that before we document it more officially).
(retrieved 2023-05-10)
We have seen some patterns to browse, filter, and download RapidAI4EO corpus data.
Browsing via the STAC is supported, but may be slow running for larger queries as the corpus has a lot of redundant metadata.
Basic spatio-temporal and labels-based queries can be optimized by using two files containing consolidated metadata.
Finally, we saw how to leverage AzCopy to download a list of file URLs, enabling us to retrieve only the portions of corpus data relevent to the application at hand.
The various patterns shown here can be built upon to filter by complex queries.
What we have so far abstracted by the utilities in rapidai4eo.py
is how to build the URLs to assets.
These URLs follow a pattern that can be easily applied to build bespoke queries not supported by the utilities.
More details are given in the Appendix below.
We had seen in the section on Optimized queries that the URLs to data and metadata files can be built with only a few pieces of information, however the URL building process itself was abstracted by our get_asset_hrefs
function.
In this appendix we will cover the patterns present in the URLs.
We will focus on three file types: Planet Fusion and Sentinel-2 imagery, and Planet Fusion quality assurance files.
There are also labels files per sample location that are compliant with the STAC label extension but these will not be covered here, preferring rather to access label information via the consolidated labels file rapidai4eo_labels.csv.gz.
Example | Prefix | UTM Zone | Tile offset | Sample ID | Product code | Date-derived basename |
---|---|---|---|---|---|---|
Sentinel-2 image | https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/ | 30N/ | 12E-191N/ | 30N_12E-191N_01_19/ | S2-SR/ | 2018-01.tif |
Planet Fusion image | https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/ | 30N/ | 12E-191N/ | 30N_12E-191N_01_19/ | PF-SR/ | 2018-01-03.tif |
Planet Fusion QA | https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/ | 30N/ | 12E-191N/ | 30N_12E-191N_01_19/ | PF-QA/ | 2018-01-03.tif |
The paths to all of these file types begins with https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery
.2
The next three components relate to the sample location: the UTM zone in which the sample is found, the grid tile3 in which the sample is found, and the sample ID.
Note that the sample ID itself contains the UTM zone and tile offset, and therefore is sufficient for building URLs to this level.
After the sample ID, we have a code that specifies the product type and finally the file basename, which is derived from the timestep the asset represents.
Planet Fusion, which is available in the corpus with a five-day cadence, has basenames represented as ISO-formatted date strings.
Sentinel-2, which is available in the corpus on a monthly cadence, has the date representations truncated to the month (i.e. with formatting '%Y-%m'
).
Therefore the URL to any asset in the corpus can be built given a sample ID, a product type, and a timestep.
2 Planet Fusion quality assurance files are also stored with the imagery prefix because they are formatted like images, in that they contain pixel-level information stored in the GeoTIFF format.
3 The tiling grid used in the corpus is the Planet Fusion processing grid, which is composed of 24 x 24km tiles identified by their offset (easting and northing) within the UTM zone.