When cloud-native geospatial meets GPU-native machine learning

Behind every great Geospatial Foundation Model, is a stack of great open source tools.

At FOSS4G SotM Oceania 2023, I gave a talk on the Pangeo Machine Learning Ecosystem in 2023 (see the recording here). One of the reasons why the Pangeo community aligns well with folks at Development Seed - is that we like to be forward-looking - eager to use Earth Observation data for understanding past patterns, but preparing for what's to come with future Climate/Weather projections and it's impact on Earth in the coming decades.

But are we ready for Petabyte scale?

The launch of the SWOT satellite altimeter in 2022 and the upcoming NISAR mission will produce Terabytes of data each day. A paradigm shift is required at this scale, not only on the technical side in terms of data storage and compute, but also on the way people collaborate on building next-generation tools for open science.

Let's level up our game by:

Shifting to accelerated GPU-native compute
Streaming data subsets on-demand
Handling complex multi-modal data inputs

As we start to catch a glimpse of multi-modal Foundation Models over the horizon, it becomes apparent that the tools we use to train these models need to scale accordingly, be energy efficient, while remaining modular enough to be repurposed for different applications.

We'll now walk you through an example of what this looks like in practice.

Going GPU-native with kvikIO

Our data journey starts at the file storage level. Traditionally, to feed data from disk storage to a Graphical Processing Unit (GPU), the data would need to go through a PCIe switch, then to your CPU and System memory (CPU RAM) via what is called a bounce buffer, before going back to the PCIe switch, and then to the GPU. Every step along this path introduces the potential for latency.

In the GPU-native world, you would use a technology like NVIDIA GPUDirect Storage (GDS), wherein the data goes via the PCIe switch and directly to the GPU, bypassing the CPU altogether. This means less latency, and faster data read speeds (which I'll show some numbers for below)! As of 2023, this is supported for qualified file systems such as ext4/xfs via cuFile, with extra setup required for network filesystems.

In terms of cloud-optimized geospatial file formats, Parquet (via cuDF) and Zarr (via kvikIO) are currently the best supported. For Zarr in particular, there is some initial support in kvikIO for decompressing LZ4-compressed datasets directly using the GPU (via nvCOMP), and cupy-xarray has an experimental kvikIO interface that would make it easier to read Zarr into GPU-backed xarray Datasets! This gets us closer to not having to rely on the CPU, and as more GPU-accelerated tools come online, this means we can leave the CPU-to-GPU bottleneck behind!

Streaming subsets on-the-fly with xbatcher

Going GPU-native allows us to read data faster, but that doesn't mean we can fit all of it into GPU memory at once! For large multi-variate and/or long time-series datasets such as the ERA5 climate reanalysis product (which I'll be showing below), we need a smarter way to handle things. A common way that geospatial Machine Learning practicioners have often used is to subset the datasets into smaller 'chips' that are stored as intermediate files, before loading them into memory. This however, results in duplicated data, and when you're working on the terabyte or petabyte scale, it can be a pain to store and manage all of those intermediate files.

Why not just stream data subsets on-the-fly? When using cloud-optimized formats like Cloud-Optimized GeoTIFFs or Zarr that have been chunked properly, accessing subsets can be efficient. Libraries like xbatcher allow us to slice datacubes intuitively along any dimension using named variables. Xbatcher uses xarray's lazy-loading mechanism behind the scenes to save memory, and while the Xarray data model doesn't care if the underlying data is a CPU or GPU-backed array, it is still nice to know that you can switch from CPU to GPU-native while using the same interface!

Composable geospatial DataPipes with zen3geo

All that said, building a geospatial data pipeline for GPU-native workflows still involves a lot of moving parts. From Composable Data Systems to Modular Deep Learning, the trend is to build plug-and-play pieces that are interoperable. Just as we are seeing AI systems that can combine visual, text and sound inputs, otherwise known as multi-modal models, we see the need to do the same for geospatial formats - raster, vector, point clouds, and so on.

One such library is zen3geo, which is designed to allow for building custom multi-sensor or multi-modal data pipelines. It implements I/O readers for standards such as Spatiotemporal Asset Catalogs (STAC) which store raster or vector file formats, and allows you to do data conversions (e.g. rasterization) or apply any custom processing function you've created. Behind the scenes, zen3geo makes extensive use of the geopandas GeoDataFrame and Xarray data model, and depends on torchdata DataPipes to chain operations together in a composable manner.

How does this work in practice? Glad you asked, because now it's time to show a proper end-to-end GPU-native example!

An example GPU-native data pipeline!

The code above shows what a GPU-native data pipeline built with zen3geo would look like for reading an ERA5 dataset from WeatherBench2, one of the climate reanalysis products used to train models for making future Climate/Weather projections. It consists of about five steps:

A pointer to the Zarr store
Reading the Zarr store with GPUDirect Storage using cupy-xarray's kvikIO engine
A custom function to select desired ERA5 data variables (e.g. wind speed)
Slicing the ERA5 datacube along longitude/latitude/time dimensions with xbatcher
A custom collate function to convert CuPy arrays to Torch tensors (zero-copy)

This DataPipe can then be passed into a regular Pytorch DataLoader, that then streams data in a GPU-native way to some neural network model. For more details, you can check out the code on GitHub here, otherwise let's see some benchmark results!

Results: ~25% less time with GPUDirect Storage

This benchmark compares the GPU-based kvikIO engine with the CPU-based zarr engine for loading an 18.2GB ERA5 subset dataset (more technical details here). We see that using GPUDirect Storage (via kvikIO) takes about 11.9 seconds, compared to about 16.0 seconds for the regular CPU-based method, or about 25% less time!

Let's say we want to scale up to a Terabyte of data, and run more epochs (or 'training' iterations). Here are some back of envelope calculations:

Scale to 55x more data (18.2GB to 1TB)
Train for 100 epochs

The two methods would take this amount of time:

Zarr (CPU-based): 16.0s x 55 x 100 = 24.4 hours
kvikIO (GPU-based): 11.9s x 55 x 100 = 18.2 hours

So we can save about 6 hours per day, which would be about 180 hours a month! Let's put those savings in practical terms, in terms of dollar amounts and carbon emissions.

Savings in terms of price and carbon emissions

Time saved	Price (USD3/hr) [1]	Carbon (0.04842 kgCO₂eq/hr) [2]	CO₂ emissions equivalent to
6 hr (day)	USD18	0.29 kgCO₂eq	35 smartphone charges 📱
180 hr (month)	USD540	8.72 kgCO₂eq	44km of driving a car 🚗
2190 hr (year)	USD6570	106.04 kgCO₂eq	431km domestic flight (Taupo to Auckland) ✈️

Assumptions:

Price: USD3/hour for NVIDIA A100 40GB GPU (range $2.39-$4.10) from https://www.paperspace.com/gpu-cloud-comparison
Carbon intensity in Sydney region (538 gCO₂eq/kWh) from https://cloud.google.com/sustainability/region-carbon
90 W power draw x 1 hour = 0.09 kWh
0.09 kWh x 538 gCO2eq/kWh = 0.04842 kgCO₂eq/hr

Towards GPU-native geospatial data science 🚀

The foundation has been laid, and a path forward is now visible. From cloud-native geospatial file formats optimized for concurrent reading, to accelerated processors that are enabling us to scale The Wall or gap between classical CPU-based systems and GPU-based AI/ML workflows. Building the next state-of-the-art will require an intricate design that leverages the power of a cloud-native and GPU-native stack.

And we hope that you will join us on this journey. Keep an eye on this space!

Credits

The Pangeo community has been instrumental in pushing the frontier of open, reproducible and scalable ways in scientific fields such as Earth Observation and Climate/Weather modelling. In particular, we'd like to acknowledge the work of Deepak Cherian at Earthmover and Negin Sobhani at NCAR for their work on cupy-xarray/kvikIO, as well as Max Jones at Carbonplan for recent developments on the xbatcher package.

Note: A trimmed version of this blog post is published at https://developmentseed.org/blog/2024-03-19-combining-cloud-gpu-native. The version here is the full version with all the technical details spelt out.