RapidAI4EO Corpus Tutorial¶

Document version: 1.0

Contents¶

  1. Introduction
  2. Downloading the corpus in its entirety
  3. Querying with STAC
    1. Spatial queries
    2. Filtering Assets
    3. Querying related Items
  4. Optimized queries
    1. Spatio-temporal queries
    2. Labels queries
  5. Downloading subsets with azcopy
  6. Conclusion
  7. Appending: Building asset URLs

1. Introduction¶

The RapidAI4EO corpus is a large dataset of analysis ready satellite time series imagery originally designed for land use and land cover (LULC) classification and change detection, but applicable to a wide range of research topics in remote sensing and machine learning. The corpus is sampled at 500,000 locations across Europe, with a time series of coincident Planet Fusion and Sentinel-2 imagery and multiclass LULC labels derived from the 2018 CORINE Land Cover (CLC) inventory, available at each location. For more details, refer to the corpus documentation.

This tutorial covers some basic aspects of accessing the corpus. Altogether, it contains 79 million image chips in addition to metadata, labels, and other resources. This comes to some 17TB of data. As such, the main goal of the tutorial is to provide patterns for efficiently querying and downloading subsets of the data relevant to researchers who do not require the corpus in its entirety.

Three types of queries, which can also be combined, will be covered here:

  1. Spatial queries: I want to find all assets that intersect with a geometry.
  2. Temporal queries: I am only interested in part of the image time series.
  3. Label queries: I am only interested in images that have a certain CLC label.

In Section 3 Querying with STAC we demonstrate the first two types of queries using the SpatioTemporal Asset Catalog (STAC) standard. As described in further detail there, the corpus has a lot of redundant metadata that can make some STAC queries rather slow. In Section 4 Optimized queries we will look at querying of the corpus by using files that store the spatial metadata and labels in a centralized and non-redundant format. In both these sections we are identifying the files that we want to download by their URLs. In Section 5 Downloading subsets with azcopy we look at one pattern for downloading from such a list of selected files.

Before we begin, we can install the dependencies of the notebook by uncommenting and running the following command:

In [1]:
# !pip install geopandas pandas pystac shapely

2. Downloading the corpus in its entirety¶

Caution: As detailed in the introduction, the corpus is voluminous. If you are only interested in subsets of the data, jump to the following sections.

For large downloads, it is recommended to use AzCopy. The entire corpus can be transferred using the following command:

azcopy copy --recursive 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/' $DESTINATION

3. Querying with STAC¶

The corpus is released with an accompanying static SpatioTemporal Asset Catalog (STAC). STAC is a flexible and provider-agnostic standard for indexing Earth observation imagery. It comes with a rich ecosystem of tools for accessing data. In this tutorial we will query data with pystac, however additional tools, including those for languages other than Python or as software plugins, are also available. Users can browse and familiarize themselves with the RapidAI4EO STAC using Radiant's STAC Browser user interface. A more detailed description of the RapidAI4EO STAC structure can be found in the "Structure" section of the corpus documentation. A basic level of familiarity with the STAC specification is recommended before proceeding with this section.

Before we jump in, it bears mentioning that our corpus and therefore its STAC contain a lot of redundant metadata. At each of the 500,000 locations in the corpus, we have coincident imagery products (Planet Fusion and Sentinel-2) as well as CLC multiclass labels. We also know that the time series of like products have the same timesteps at all locations. If we aren't careful about how we query the STAC, this will lead to a lot of redundant spatio-temporal comparisons. In the section Optimized queries we use some of this knowledge to find the Assets we're looking for more efficiently.

The additional latency induced by retrieving STAC files via HTTPS could be improved by first downloading the STAC and querying it from the local instance, however be aware that the STAC itself is also quite voluminous (approximately 38GB).

In [2]:
import pandas as pd
from pystac import Catalog, Item
import shapely.geometry
In [3]:
catalog_uri = (
    "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/stac-v1.0/catalog.json"
)
catalog = Catalog.from_file(catalog_uri)
In [4]:
for collection in catalog.get_collections():
    print(collection.id)
rapidai4eo_v1_source_pf
rapidai4eo_v1_source_s2
rapidai4eo_v1_labels

When we list the Collections in our Catalog, we see that there is one for each type of product: Planet Fusion imagery (shorthand: pf), Sentinel-2 imagery (s2), and the CLC multiclass labels (labels).

3.A. Spatial queries¶

Now, let's write a function that crawls a STAC Collection and returns all Items intersecting a given geometry. We will search a single Collection, rather than all Collections in the Catalog, because we know that we have matching products at each location. Running the same spatial query over multiple Collections would be redundant, whereas we can use the links in the STAC Items to find the matching Items for other products at the same location. Note that STAC always uses EPSG:4326 geometries.

In [5]:
def items_intersecting_geometry(collection, geometry):
    """Recursively find all STAC items intersecting a geometry.

    Our STAC structure (further detailed in the corpus documentation) has a hierarchy of
    Collections to speed up spatial queries. Recursively search these Collections, and
    return all STAC Items that intersect our geometry.
    """
    intersecting_items = []
    collection_bboxes = [
        shapely.geometry.box(*bounds) for bounds in collection.extent.spatial.bboxes
    ]
    if any([bbox.intersects(geometry) for bbox in collection_bboxes]):

        # Collect all matching items in this collection
        for item in collection.get_items():
            item_geometry = shapely.geometry.shape(item.geometry)
            if item_geometry.intersects(geometry):
                intersecting_items.append(item)

        # Recursively search our nested collections for items
        for subcollection in collection.get_collections():
            intersecting_items += items_intersecting_geometry(subcollection, geometry)

    return intersecting_items

For our example query, we will use a rough bounding box around Berlin as our search geometry and we will query the Planet Fusion STAC Collection.

In [6]:
berlin_bbox = shapely.geometry.box(13.05, 52.35, 13.72, 52.69)
pf_collection = catalog.get_child("rapidai4eo_v1_source_pf")
In [7]:
%%time
berlin_pf_items = items_intersecting_geometry(pf_collection, berlin_bbox)
print(f"Found {len(berlin_pf_items)} in the vicinity of Berlin.")
Found 198 in the vicinity of Berlin.
CPU times: user 30.3 s, sys: 2.08 s, total: 32.4 s
Wall time: 2min 58s

The search time is approaching three minutes, despite querying against a small bounding box. Crawling over the STAC like this is not recommended for large geometries.

Our spatial query has returned 198 Planet Fusion Items. Within our STAC structure, an Item represents all timesteps for one product at one location. Within the Item, the actual files are referenced as Assets.

Let's take a look at a few Assets to get a feel for them.

In [8]:
example_pf_item = berlin_pf_items[0]
example_pf_assets = example_pf_item.get_assets()
n_head = 5

for i, (asset_name, asset) in enumerate(example_pf_assets.items()):
    print(f"Asset: {asset_name}:")
    print(asset.to_dict())
    print("---")

    if i >= n_head - 1:
        break
Asset: sr_2018-01-03:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-01-03.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-03T10:30:00Z', 'roles': ['data']}
---
Asset: qa_2018-01-03:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-QA/2018-01-03.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-03T10:30:00Z', 'roles': ['metadata']}
---
Asset: sr_2018-01-08:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-01-08.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-08T10:30:00Z', 'roles': ['data']}
---
Asset: qa_2018-01-08:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-QA/2018-01-08.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-08T10:30:00Z', 'roles': ['metadata']}
---
Asset: sr_2018-01-13:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-01-13.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-13T10:30:00Z', 'roles': ['data']}
---

3.B. Filtering Assets¶

Assets are identified by a key that is unique per Item. Identifying the timestep to which an Asset belongs can be achieved by looking at the Asset's datetime property, in the case of pystac by accessing asset.extra_fields['datetime'].

An Asset also has roles, in the case of this corpus either "data" or "metadata". "Data" Assets reference the image chips themselves, whereas Planet Fusion also has pixelwise quality assurance masks linked as "metadata" assets.1 Note that the Asset keys were also designed to provide information on the type of file ("sr" for surface reflectance image chip or "qa" for quality assurance masks) and timestep. This allows us to quickly select the Assets we are interested in if we know these criteria.

Each Asset also provides the absolute URL to download the file via the href property.

Let's look at a some toy examples for filtering Assets.

1 Sentinel-2 metadata in the form of traceability manifests are stored at the tile Collection level. For more details, refer to the corpus documentation.

In [9]:
# Define the time period for filtering
start_date = pd.to_datetime("2018-05-01T00:00:00Z")
end_date = pd.to_datetime("2018-06-01T00:00:00Z")

# Filter the assets based on the defined time period
time_filtered_assets = {
    k: v
    for k, v in example_pf_assets.items()
    if start_date < pd.to_datetime(v.extra_fields["datetime"]) < end_date
}

# Print the number of assets and their names within the time period
print(f"{len(time_filtered_assets)} Assets in time period {start_date} - {end_date}:")
for asset_name, _ in time_filtered_assets.items():
    print(f"    {asset_name}.")

# Define the role for further filtering
role = "data"

# Filter the time-filtered assets based on the defined role
role_filtered_assets = {
    k: v for k, v in time_filtered_assets.items() if v.has_role(role)
}

print(f'Of which, {len(role_filtered_assets)} have the role "data".')
for asset_name, _ in role_filtered_assets.items():
    print(f"    {asset_name}.")
12 Assets in time period 2018-05-01 00:00:00+00:00 - 2018-06-01 00:00:00+00:00:
    sr_2018-05-03.
    qa_2018-05-03.
    sr_2018-05-08.
    qa_2018-05-08.
    sr_2018-05-13.
    qa_2018-05-13.
    sr_2018-05-18.
    qa_2018-05-18.
    sr_2018-05-23.
    qa_2018-05-23.
    sr_2018-05-28.
    qa_2018-05-28.
Of which, 6 have the role "data".
    sr_2018-05-03.
    sr_2018-05-08.
    sr_2018-05-13.
    sr_2018-05-18.
    sr_2018-05-23.
    sr_2018-05-28.

And we can also list the URLs of the files for download, which we could then download using azcopy (see Downloading subsets with azcopy), the Azure Storage SDK, or via HTTPS calls.

In [10]:
print("URLs of filtered Assets:")
for asset_name, asset in role_filtered_assets.items():
    print(asset.href)
URLs of filtered Assets:
https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-03.tif
https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-08.tif
https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-13.tif
https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-18.tif
https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-23.tif
https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/PF-SR/2018-05-28.tif

3.C. Querying related Items¶

So far our examples have just iterated over a single example Item. We could iterate over the rest of our Items, running the same or other queries, to generate a list of all relevant Assets in our original spatial query. If we want to reuse the same temporal and role filtering, we can also rely on the knowledge that each location has the same timesteps and retrieve the corresponding Assets by reusing the keys our first temporal and role query had returned.

In [11]:
query_pf_hrefs = []
pf_asset_query_keys = [*role_filtered_assets.keys()]
for item in berlin_pf_items:
    for k in pf_asset_query_keys:
        query_pf_hrefs.append(item.assets[k].href)
print(f"Found {len(query_pf_hrefs):,} matching URLs.")
Found 1,188 matching URLs.

We have been filtering within the Planet Fusion Collection only, but if we can also tie together the corresponding Sentinel-2 images and CLC labels by using the links within the Planet Fusion Items, specifically the links with relation type "related". Let's take a first look at our single example_item again to see how these relations work.

In [12]:
related_items = [
    link.resolve_stac_object().target
    for link in example_pf_item.links
    if link.rel == "related"
]

for item in related_items:
    print(f"Found related item in Collection {item.collection_id}.")
Found related item in Collection rapidai4eo_v1_labels.
Found related item in Collection rapidai4eo_v1_source_s2.

When looking at the related Items there a few ways of getting information about what the Item represents (such as item.properties['title']), but the one that will be most convenient to filter by, in the case we only want one or the other product type, is item.collection_id. As we know that there is a unique Collection per product type, we could filter the related Items by those that belong to a specific collection. We've looked at Planet Fusion Assets, now let's also inspect Sentinel-2 and CLC labels examples.

In [13]:
for item in related_items:
    print(f"Found related item in Collection {item.collection_id}.")

    for i, (asset_name, asset) in enumerate(item.assets.items()):
        print(f"Asset: {asset_name}:")
        print(asset.to_dict())
        print("---")

        if i >= n_head - 1:
            break

    print("---")
Found related item in Collection rapidai4eo_v1_labels.
Asset: labels:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/labels/33N/16E-243N/labels_33N_16E-243N_29_01.geojson', 'type': 'application/geo+json', 'roles': ['labels', 'labels-vector']}
---
---
Found related item in Collection rapidai4eo_v1_source_s2.
Asset: s2_2018-01:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-01.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-01-15T10:30:00Z', 'roles': ['data']}
---
Asset: s2_2018-02:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-02.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-02-15T10:30:00Z', 'roles': ['data']}
---
Asset: s2_2018-03:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-03.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-03-15T10:30:00Z', 'roles': ['data']}
---
Asset: s2_2018-04:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-04.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-04-15T10:30:00Z', 'roles': ['data']}
---
Asset: s2_2018-05:
{'href': 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/33N/16E-243N/33N_16E-243N_29_01/S2-SR/2018-05.tif', 'type': 'image/tiff; application=geotiff', 'datetime': '2018-05-15T10:30:00Z', 'roles': ['data']}
---
---

We see here that the corresponding labels Item has a single Asset pointing to a GeoJSON file, which will contain the CLC multiclass labels. The Sentinel-2 Item has one Asset for each of the monthly image mosiacs. Tying together some of the steps above, let's look at filtering all Planet Fusion and Sentinel-2 Assets.

In [14]:
query_pfs2_hrefs = []
pf_asset_query_keys = [*role_filtered_assets.keys()]
s2_asset_query_keys = None

for pf_item in berlin_pf_items:

    # Filter Planet Fusion Assets
    for k in pf_asset_query_keys:
        query_pfs2_hrefs.append(pf_item.assets[k].href)

    # Filter Sentinel-2 Assets related to that Planet Fusion Item (i.e. is sampled at the same location)
    related_items = [
        link.resolve_stac_object().target
        for link in example_pf_item.links
        if link.rel == "related"
    ]

    for related_item in related_items:

        # Filter down to s2 items only (exclude labels for this example)
        if related_item.collection_id == "rapidai4eo_v1_source_s2":

            if s2_asset_query_keys is None:
                # This is our first look at s2 Assets, so the keys we want to access have not
                # yet been defined
                s2_asset_query_keys = [
                    k
                    for k, v in related_item.assets.items()
                    if start_date
                    < pd.to_datetime(v.extra_fields["datetime"])
                    < end_date
                ]
                # NOTE "role" filtering to only "data" is not necessary in the case of Sentinel-2
                # as no metadata files are linked to the STAC Items.

            for k in s2_asset_query_keys:
                query_pfs2_hrefs.append(related_item.assets[k].href)

print(f"Found {len(query_pfs2_hrefs):,} matching URLs.")
Found 1,386 matching URLs.

We have seen some approaches to querying the STAC and applied some of our knowledge about redundant metadata in the corpus to avoid redundantly accessing remote STAC files. The patterns applied here can be adapted to filter the data by any spatio-temporal criteria. Crawling a static STAC can be slow, and our workarounds to avoid redundant filtering are not always straightforward. In the next section we will look at using some resource files to run similar queries in a faster and more straightforward manner.

4. Optimized queries¶

As mentioned above we can leverage our knowledge about the redundant nature of metadata in the corpus to improve query times, albeit by working somewhat outside the STAC standard. To do this we will use three files that give us an overview of the metadata in the corpus:

  1. rapidai4eo_geometries.geojson.gz, containing the geometries of all sample locations;
  2. rapidai4eo_labels.csv.gz, containing the CLC multiclass labels at all sample locations; and
  3. rapidai4eo_label_mappings.csv, containing a mapping between the three levels of the CLC class hierarchy.

All three of these files should be downloaded for the below examples to run (see the second following code block to download these via cURL). If not downloading to the same directory as this notebook, or if the files are renamed, update the variables in the cell below.

In [15]:
import gzip

import geopandas as gpd
import pandas as pd
import shapely.geometry

geometries_file = "rapidai4eo_geometries.geojson.gz"
labels_file = "rapidai4eo_labels.csv.gz"
labels_mapping_file = "rapidai4eo_label_mappings.csv"

geometries_file_url = "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/resources/rapidai4eo_geometries.geojson.gz"
labels_file_url = "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/resources/rapidai4eo_labels.csv.gz"
labels_mapping_file_url = "https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/resources/rapidai4eo_label_mappings.csv"

Optionally, download the resource files via cURL.

In [16]:
!curl {geometries_file_url} -o {geometries_file}
!curl {labels_file_url} -o {labels_file}
!curl {labels_mapping_file_url} -o {labels_mapping_file}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.0M  100 15.0M    0     0  20.3M      0 --:--:-- --:--:-- --:--:-- 20.3M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8520k  100 8520k    0     0  12.2M      0 --:--:-- --:--:-- --:--:-- 12.2M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1759  100  1759    0     0  13126      0 --:--:-- --:--:-- --:--:-- 13029
In [17]:
%%time
with gzip.open(geometries_file) as fs:
    geometries = gpd.read_file(fs).set_index("sample_id")
labels = pd.read_csv(labels_file).set_index("sample_id")
labels_mapping = pd.read_csv(labels_mapping_file).set_index("index")

print("Geometries:")
print(geometries.head())
print("---")
print("Labels (columns truncated):")
print(labels[labels.columns[:6]].head())
print("---")
Geometries:
                                                             geometry
sample_id                                                            
35N_19E-292N_23_31  POLYGON ((26.49348 63.29178, 26.49338 63.29717...
35N_19E-292N_01_05  POLYGON ((26.17905 63.40880, 26.17889 63.41418...
35N_19E-292N_17_03  POLYGON ((26.15755 63.32251, 26.15739 63.32789...
35N_19E-292N_05_11  POLYGON ((26.25168 63.38765, 26.25154 63.39304...
35N_19E-292N_36_28  POLYGON ((26.45890 63.22165, 26.45880 63.22704...
---
Labels (columns truncated):
                    clc_111   clc_112  clc_121  clc_122  clc_123  clc_124
sample_id                                                                
31N_20E-211N_16_38      0.0  0.000000      0.0      0.0      0.0      0.0
35N_24E-173N_23_37      0.0  0.269575      0.0      0.0      0.0      0.0
32N_20E-205N_28_20      0.0  0.254225      0.0      0.0      0.0      0.0
33N_24E-237N_23_27      0.0  0.000000      0.0      0.0      0.0      0.0
33N_27E-224N_33_23      0.0  0.000000      0.0      0.0      0.0      0.0
---
CPU times: user 29.5 s, sys: 421 ms, total: 30 s
Wall time: 30 s

Each sample location within the corpus has a unique ID, which is integral to building the IDs of Items within the STAC and the URLs of the Assets. For example, the sample location with ID 35N_19E-292N_23_31 will correspond to a Planet Fusion Item (rapidai4eo_v1_source_pf_35N_19E-292N_23_31), a Sentinel-2 Item (rapidai4eo_v1_source_s2_35N_19E-292N_23_31), and a label Item (rapidai4eo_v1_labels_35N_19E-292N_23_31). Similarly, given a sample ID, a product type, and—in the case of our image products—a valid date we can build the URLs to any Asset in the corpus, as described in the Appendix. We will rely on these files to filter by geometries or CLC class and apply our knowledge of the timesteps available in the corpus to quickly query Assets.

4.A. Spatio-temporal queries¶

Let us begin by running the same query over Berlin we had run above.

In [18]:
%%time
berlin_bbox = shapely.geometry.box(13.05, 52.35, 13.72, 52.69)
spatially_filtered_ids = geometries[geometries.geometry.intersects(berlin_bbox)].index
print(f"{spatially_filtered_ids.size} IDs found in the vicinity of Berlin.")
print("---")
198 IDs found in the vicinity of Berlin.
---
CPU times: user 123 ms, sys: 12 ms, total: 135 ms
Wall time: 134 ms

As we can see, the spatial query returned in a fraction of the time it takes to crawl a STAC Collection. When querying against larger geometries, say, an entire country, this will make a difference from hours to minutes.

What we've identified so far are the IDs of locations where we have Assets we will want to download. The next step is to build the URLs to those assets so that we can download them. To achieve this, we'll lean on the get_asset_hrefs function in the rapidai4eo.py file provided with this tutorial (download that file to the same directory as the notebook if you have not already). This class takes the busywork out of transforming a list of IDs into a list of hrefs. If we just pass it a list of IDs, it will return URLs for all products (Planet Fusion imagery, Sentinel-2 imagery, and Planet Fusion quality assurance masks) and all available timesteps at those locations. We can optionally pass a list of products or a start and end date as a tuple to filter by those criteria.

In [19]:
from rapidai4eo import get_asset_hrefs, PRODUCTS

print(PRODUCTS)
['pfsr', 'pfqa', 's2']

We see here that we have the following product keys: 'pfsr' for Planet Fusion imagery, 'pfqa' for Planet Fusion quality assurance masks, and 's2' for Sentinel-2 imagery.

In [20]:
hrefs = get_asset_hrefs(spatially_filtered_ids)
print(
    f"{len(hrefs):,} image chips and quality assurance masks in our area of interest."
)

hrefs = get_asset_hrefs(spatially_filtered_ids, products=["pfsr", "s2"])
print(
    f"{len(hrefs):,} image chips in our area of interest, ignoring quality assurance masks"
)

hrefs = get_asset_hrefs(spatially_filtered_ids, products="s2")
print(f"{len(hrefs):,} Sentinel-2 image chips only in our area of interest.")

start_date = pd.to_datetime("2018-01-01T00:00:00Z")
end_date = pd.to_datetime("2019-01-01T00:00:00Z")
hrefs = get_asset_hrefs(
    spatially_filtered_ids,
    products=["pfsr", "s2"],
    temporal_filter=(start_date, end_date),
)
print(
    f"{len(hrefs):,} Planet Fusion and Sentinel-2 image chips in our area of interest and "
    "filtering by the time period for which both products are available (2018)."
)
60,192 image chips and quality assurance masks in our area of interest.
31,284 image chips in our area of interest, ignoring quality assurance masks
2,376 Sentinel-2 image chips only in our area of interest.
16,830 Planet Fusion and Sentinel-2 image chips in our area of interest and filtering by the time period for which both products are available (2018).

4.B. Labels queries¶

In the above examples we applied a spatial query to identify the relevant sample IDs, but if we are interested in certain CLC classes we could also identify sample IDs from the labels file. First let's take a look at our labels mapping DataFrame, which allows us to map from the names of the classes ("class_name") to the corresponding column in our labels DataFrame (via "index").

In [21]:
labels_mapping
Out[21]:
clc_level1_id clc_level2_id clc_level3_id class_name
index
clc_111 1 11 111 Continuous urban fabric
clc_112 1 11 112 Discontinuous urban fabric
clc_121 1 12 121 Industrial or commercial units
clc_122 1 12 122 Road and rail networks and associated land
clc_123 1 12 123 Port areas
clc_124 1 12 124 Airports
clc_131 1 13 131 Mineral extraction sites
clc_132 1 13 132 Dump sites
clc_133 1 13 133 Construction sites
clc_141 1 14 141 Green urban areas
clc_142 1 14 142 Sport and leisure facilities
clc_211 2 21 211 Non-irrigated arable land
clc_212 2 21 212 Permanently irrigated land
clc_213 2 21 213 Rice fields
clc_221 2 22 221 Vineyards
clc_222 2 22 222 Fruit trees and berry plantations
clc_223 2 22 223 Olive groves
clc_231 2 23 231 Pastures
clc_241 2 24 241 Annual crops associated with permanent crops
clc_242 2 24 242 Complex cultivation patterns
clc_243 2 24 243 Land principally occupied by agriculture with ...
clc_244 2 24 244 Agro-forestry areas
clc_311 3 31 311 Broad-leaved forest
clc_312 3 31 312 Coniferous forest
clc_313 3 31 313 Mixed forest
clc_321 3 32 321 Natural grasslands
clc_322 3 32 322 Moors and heathland
clc_323 3 32 323 Sclerophyllous vegetation
clc_324 3 32 324 Transitional woodland-shrub
clc_331 3 33 331 Beaches dunes sands
clc_332 3 33 332 Bare rocks
clc_333 3 33 333 Sparsely vegetated areas
clc_334 3 33 334 Burnt areas
clc_335 3 33 335 Glaciers and perpetual snow
clc_411 4 41 411 Inland marshes
clc_412 4 41 412 Peat bogs
clc_421 4 42 421 Salt marshes
clc_422 4 42 422 Salines
clc_423 4 42 423 Intertidal flats
clc_511 5 51 511 Water courses
clc_512 5 51 512 Water bodies
clc_521 5 52 521 Coastal lagoons
clc_522 5 52 522 Estuaries
clc_523 5 52 523 Sea and ocean
clc_999 9 99 999 NODATA

Let's say we are interested only in peat bogs for some research topic. Looking at the mappings, we can see that the portion of a sample locations covered by peat bog according to the 2018 CLC will be represented in our labels DataFrame by the column "clc_412". If we want to find all sample locations that are at least 80% covered by peat bog, we filter that column by values greater than 0.8.

In [22]:
peat_bog_samples = labels[labels["clc_412"] > 0.8].index
print(f"{peat_bog_samples.size:,}")
2,753

Now we can select the image chip URLs that we want to download in the same pattern we had used above, for example:

In [23]:
hrefs = get_asset_hrefs(peat_bog_samples, products="pfsr")
print(f"{len(hrefs):,} Planet Fusion image chips over peat bogs.")
401,938 Planet Fusion image chips over peat bogs.

Let's tie this together with a final example. For this case we will search for all image chips within Germany that are entirely covered by forest, also applying a temporal filter to restrict our results to the time period for which we have both image types (2018). We will read the geometry for Germany from Eruostat's NUTS dataset. In the case of forest classes, there are three level-3 entries in the CLC ontology: coniferous, deciduous, and mixed. For the labels query, we will combine these three classes by summation.

In [24]:
eu_countries_file = (
    "https://gisco-services.ec.europa.eu/distribution/v2/nuts/topojson/"
    "NUTS_RG_20M_2021_4326_LEVL_0.json"
)
eu_countries = gpd.read_file(eu_countries_file).set_index("id")
germany_geometry = eu_countries.loc["DE", "geometry"]
In [25]:
%%time
# Filter by our country geometry
germany_samples = geometries[geometries.geometry.within(germany_geometry)].index

# At level-3, the CLC has three forest classes which all belong
# to CLC level-2 class 31.
forest_classes = labels_mapping[labels_mapping.clc_level2_id == 31].index.values

# Filter all available labels by our above spatial filter, keeping
# only relevant (forest class) columns
germany_labels = labels.loc[germany_samples][forest_classes]
# And now we can get the proportion of each location covered by any forest
# class simply by summing across columns.
germany_forest_label = germany_labels.sum(axis=1)

# Let's say we now want only those locations entirely covered by forest
germany_forest_samples = germany_forest_label[germany_forest_label == 1.0].index
print(
    f"Found {germany_forest_samples.size:,} samples in Germany entirely covered by forest (CLC class 31)."
)

# And finally we can select the URLs to download any product types we want
# Let's say again we want both image types, without the QA files, for the
# calendar year 2019.
start_date = pd.to_datetime("2018-01-01T00:00:00Z")
end_date = pd.to_datetime("2019-01-01T00:00:00Z")
hrefs = get_asset_hrefs(
    germany_forest_samples,
    products=["pfsr", "s2"],
    temporal_filter=(start_date, end_date),
)
print(
    f"{len(hrefs):,} Planet Fusion and Sentinel-2 image chips in our area of interest and "
    "filtering by the time period for which both products are available (2018)."
)
Found 3,214 samples in Germany entirely covered by forest (CLC class 31).
273,190 Planet Fusion and Sentinel-2 image chips in our area of interest and filtering by the time period for which both products are available (2018).
CPU times: user 4.55 s, sys: 15 ms, total: 4.56 s
Wall time: 4.56 s

For this example query we now have a list of 273,190 image Assets, representing the time series for both Sentinel-2 and Planet Fusion at 3,214 locations across Germany covered entirely by forest. The same query run against the static STAC would have been prohibitively slow, but here it runs in about five seconds.

As mentioned above, now that we have a list of URLs we can download the matching files at will. Two approaches not covered in this tutorial would be to script the downloads leveraging the Azure Storage SDK or python-requests. In the next section we look at using AzCopy to download files from a list.

5. Downloading subsets with azcopy¶

If we save a list of file URLs we want to download, as identified in the patterns from the above examples, we can pass those to an azcopy command.

azcopy copy 'https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/' my/destination/directory/ --list-of-files="files_of_interest.txt" --include-directory-stub

where

  • --list-of-files is where we pass the text file in which we've saved the file URLs we want to download; and
  • the --include-directory-stub flag creates subdirectories in our download destination to match those in our source, which prevents file name collisions.

Be aware that the AzCopy wiki makes the following statement about the --list-of-files flag:

As of version 10.3, this flag remains undocumented (except for this wiki page) and it does not appear in the in-app command line help. It is however regularly used by Azure Storage Explorer to pass file lists to its embedded copy of AzCopy.

(The reason that it's undocumented is that, at high file counts, it has slow performance for downloads and service-to-service copies. We want to fix that before we document it more officially).

(retrieved 2023-05-10)

6. Conclusion¶

We have seen some patterns to browse, filter, and download RapidAI4EO corpus data. Browsing via the STAC is supported, but may be slow running for larger queries as the corpus has a lot of redundant metadata. Basic spatio-temporal and labels-based queries can be optimized by using two files containing consolidated metadata. Finally, we saw how to leverage AzCopy to download a list of file URLs, enabling us to retrieve only the portions of corpus data relevent to the application at hand. The various patterns shown here can be built upon to filter by complex queries. What we have so far abstracted by the utilities in rapidai4eo.py is how to build the URLs to assets. These URLs follow a pattern that can be easily applied to build bespoke queries not supported by the utilities. More details are given in the Appendix below.


7. Appendix: Building asset URLs¶

We had seen in the section on Optimized queries that the URLs to data and metadata files can be built with only a few pieces of information, however the URL building process itself was abstracted by our get_asset_hrefs function. In this appendix we will cover the patterns present in the URLs. We will focus on three file types: Planet Fusion and Sentinel-2 imagery, and Planet Fusion quality assurance files. There are also labels files per sample location that are compliant with the STAC label extension but these will not be covered here, preferring rather to access label information via the consolidated labels file rapidai4eo_labels.csv.gz.

Example Prefix UTM Zone Tile offset Sample ID Product code Date-derived basename
Sentinel-2 image https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/ 30N/ 12E-191N/ 30N_12E-191N_01_19/ S2-SR/ 2018-01.tif
Planet Fusion image https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/ 30N/ 12E-191N/ 30N_12E-191N_01_19/ PF-SR/ 2018-01-03.tif
Planet Fusion QA https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery/ 30N/ 12E-191N/ 30N_12E-191N_01_19/ PF-QA/ 2018-01-03.tif

The paths to all of these file types begins with https://radiantearth.blob.core.windows.net/mlhub/rapidai4eo/imagery.2 The next three components relate to the sample location: the UTM zone in which the sample is found, the grid tile3 in which the sample is found, and the sample ID. Note that the sample ID itself contains the UTM zone and tile offset, and therefore is sufficient for building URLs to this level. After the sample ID, we have a code that specifies the product type and finally the file basename, which is derived from the timestep the asset represents. Planet Fusion, which is available in the corpus with a five-day cadence, has basenames represented as ISO-formatted date strings. Sentinel-2, which is available in the corpus on a monthly cadence, has the date representations truncated to the month (i.e. with formatting '%Y-%m'). Therefore the URL to any asset in the corpus can be built given a sample ID, a product type, and a timestep.

2 Planet Fusion quality assurance files are also stored with the imagery prefix because they are formatted like images, in that they contain pixel-level information stored in the GeoTIFF format.

3 The tiling grid used in the corpus is the Planet Fusion processing grid, which is composed of 24 x 24km tiles identified by their offset (easting and northing) within the UTM zone.