Introduction to Xarray

Introduction to Xarray#

In data science setting, it is common to deal with datasets that by nature are tabular and easier to manage. However, how these same ideas translate to designing datasets and datastructures that take into account specific domain knowledge. For example, for earth and climate sciences it is important to manage remote sensing data that usually comes in the form of large dimensional arrays that can include two-dimensional arrays that can contain time series data and measurements in more than one specific band.

Resources:

Acknowledgment: Large part of the contents in this notebook were done by Dr. Chelle Gentemann.

1. Motivation#

Assumptions about how data is structured. For example, the two basic types of datastructures we work in Python are

Lists, matrices, multidimensional arrays: These are very quite common structures in the physical sciences. The main library to manipulate these tensorial structures in Python is numpy.
Tabular data: pandas, assumption of observations and features This is quite natural datastructures that we often find in data science projects. However, they don’t include all the types of data structures we want to work with.

Note

The library pandas is internally designed using numpy. However, the conceptual setup and the way users manipulate data with the library are radically different.

However, when we start dealing with multidimensional data (eg, three dimensional data involving latitude, longitude and time) we start having problems, including:

how do we keep track of which ones are our coordinate variables? Latitude, longitude and time are quite special physical quantities. If we just work with numpy arrays, we have no way of knowing which dimension corresponds to each coordinate.
How to store multiple datasets using the same coordinate system. Even worse, how do we keep track of datasets with different dimensions? For example, for a dataset that collects temperature measurements, we can imagine using a three dimensional array (lat, lon, time). However, we may want to also include a dataset with surface elevations, for which time is useless and we have a dataset in (lat, lon).
How can we include information in our array about the dataset? This includes the metadata, units, product specifications. Notice that numpy arrays don’t carry units.

2. Xarray#

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#import seaborn as sns
import xarray as xr

#import os
from pathlib import Path

# Small style adjustments for more readable plots
plt.style.use("seaborn-whitegrid")
plt.rcParams["figure.figsize"] = (8, 6)
plt.rcParams["font.size"] = 14

/tmp/ipykernel_386/14117147.py:11: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use("seaborn-whitegrid")

For the purposes of this tutorial, we are going to be working with the satellite data product ERA5:

Atmospheric global climate reanalyses
From 1979-2019, hourly estimates of atmospheric, land and oceanic climate variables.
30km global grid with 137 vertical grid points.

Note

Notice where the dataset is stored. If you are running this notebook from the Hub, you will see this is located on our shared folder.

DATA_DIR = Path.home()/Path('shared/climate-data')
monthly_2deg_path = DATA_DIR / "era5_monthly_2deg_aws_v20210920.nc"
ds = xr.open_dataset(monthly_2deg_path)

We can open the .nc dataset directly as a xarray. A xarray.Dataset consists in a collection of objects, including:

Dimensions
Coordinates
Data Variables

You can directly visualize all these objects by displaying the xarray:

There are a few things we can observe here:

Types of each data object and their respective dimensions.
Metadata
We can observe at the atributes of the datasets by clicking in the icons next to each dataset.

2.3. Making plots#

Making plots with xarray is extremely easy. One of the advantages of using xarray is that the plot will automatically include axis information, since each numerical value in the xarray has assigned a name, either by they coordinate of the dataset product name.

ds.air_temperature_at_2_metres.sel(time="1979-01").plot();

../../_images/5f43d19d299c05bc6e6f2e407e923424625feab81304e70dff46c0e1f50ed0e3.png

This is just a line of code… quite impressive.

Also, depending what we want to plot, xarray will realize which type of plot we want to make. For example, if we subset in both latitude and longitude, xarray realizes that we want to plot a timeseries:

ds.air_temperature_at_2_metres.sel(latitude=37.125, longitude=238.875).plot();

../../_images/5f7e960e89f32f974d1239ad4ac74e358723673c980a17acfabb8d1ec17da915.png

Introduction to Xarray

Contents

Introduction to Xarray#

1. Motivation#

2. Xarray#

2.1. Basic exploration#

2.2. Subsetting data#

2.3. Making plots#

2.4. Operations with xArray#

2.5. Groupby#