{ "cells": [ { "cell_type": "markdown", "id": "10e14ca6-ce25-4372-9987-40ab5ea5d8d6", "metadata": { "user_expressions": [] }, "source": [ "## Introduction to Xarray\n", "\n", "In data science setting, it is common to deal with datasets that by nature are tabular and easier to manage. However, how these same ideas translate to designing datasets and datastructures that take into account specific domain knowledge. For example, for earth and climate sciences it is important to manage remote sensing data that usually comes in the form of large dimensional arrays that can include two-dimensional arrays that can contain time series data and measurements in more than one specific band. \n", "\n", "Resources:\n", "- [Dr. Chelle Gentemann Presentation](https://docs.google.com/presentation/d/11Tqlzbfq2rSjRJ6HoggLCFzdl-Nx8k1HWsefRWoaSGc/edit#slide=id.g215fc39eaf6_0_6)\n", "- [xarray documentation](https://docs.xarray.dev/en/stable/)\n", "\n", "**Acknowledgment:** Large part of the contents in this notebook were done by [Dr. Chelle Gentemann](https://cgentemann.github.io)." ] }, { "cell_type": "markdown", "id": "9fb73f86-fe92-49cb-abc5-81d76d438a00", "metadata": { "user_expressions": [] }, "source": [ "### 1. Motivation\n", "\n", "Assumptions about how data is structured. For example, the two basic types of datastructures we work in Python are \n", "- Lists, matrices, multidimensional arrays: These are very quite common structures in the physical sciences. The main library to manipulate these _tensorial_ structures in Python is `numpy`.\n", "- Tabular data: `pandas`, assumption of _observations_ and _features_\n", "This is quite natural datastructures that we often find in data science projects. However, they don't include all the types of data structures we want to work with. \n", "```{note}\n", "The library `pandas` is internally designed using `numpy`. However, the conceptual setup and the way users manipulate data with the library are radically different. \n", "```\n", "\n", "However, when we start dealing with multidimensional data (eg, three dimensional data involving latitude, longitude and time) we start having problems, including: \n", "- how do we keep track of which ones are our _coordinate variables_? Latitude, longitude and time are quite special physical quantities. If we just work with numpy arrays, we have no way of knowing which dimension corresponds to each coordinate. \n", "- How to store multiple datasets using the same coordinate system. Even worse, how do we keep track of datasets with different dimensions? For example, for a dataset that collects temperature measurements, we can imagine using a three dimensional array (lat, lon, time). However, we may want to also include a dataset with surface elevations, for which time is useless and we have a dataset in (lat, lon). \n", "- How can we include information in our array about the dataset? This includes the metadata, units, product specifications. Notice that `numpy` arrays don't carry units. \n" ] }, { "cell_type": "markdown", "id": "14ac59ff-3516-4034-a0cc-21136bc35f56", "metadata": { "user_expressions": [] }, "source": [ "## 2. Xarray" ] }, { "cell_type": "code", "execution_count": 1, "id": "dbc5f389-41ff-40a4-a2fe-32d0d1348492", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_386/14117147.py:11: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-
<xarray.Dataset>\n", "Dimensions: (\n", " time: 504,\n", " latitude: 90,\n", " longitude: 180)\n", "Coordinates:\n", " * time (time) datetime64[ns] ...\n", " * latitude (latitude) float32 ...\n", " * longitude (longitude) float32 ...\n", "Data variables: (12/15)\n", " air_pressure_at_mean_sea_level (time, latitude, longitude) float32 ...\n", " air_temperature_at_2_metres (time, latitude, longitude) float32 ...\n", " air_temperature_at_2_metres_1hour_Maximum (time, latitude, longitude) float32 ...\n", " air_temperature_at_2_metres_1hour_Minimum (time, latitude, longitude) float32 ...\n", " dew_point_temperature_at_2_metres (time, latitude, longitude) float32 ...\n", " eastward_wind_at_100_metres (time, latitude, longitude) float32 ...\n", " ... ...\n", " northward_wind_at_100_metres (time, latitude, longitude) float32 ...\n", " northward_wind_at_10_metres (time, latitude, longitude) float32 ...\n", " precipitation_amount_1hour_Accumulation (time, latitude, longitude) float32 ...\n", " sea_surface_temperature (time, latitude, longitude) float32 ...\n", " snow_density (time, latitude, longitude) float32 ...\n", " surface_air_pressure (time, latitude, longitude) float32 ...\n", "Attributes:\n", " institution: ECMWF\n", " source: Reanalysis\n", " title: ERA5 forecasts" ], "text/plain": [ "
<xarray.DataArray 'air_temperature_at_2_metres' (time: 504, latitude: 90,\n", " longitude: 180)>\n", "[8164800 values with dtype=float32]\n", "Coordinates:\n", " * time (time) datetime64[ns] 1979-01-16T11:30:00 ... 2020-12-16T11:30:00\n", " * latitude (latitude) float32 -88.88 -86.88 -84.88 ... 85.12 87.12 89.12\n", " * longitude (longitude) float32 0.875 2.875 4.875 6.875 ... 354.9 356.9 358.9\n", "Attributes:\n", " long_name: 2 metre temperature\n", " nameCDM: 2_metre_temperature_surface\n", " nameECMWF: 2 metre temperature\n", " product_type: analysis\n", " shortNameECMWF: 2t\n", " standard_name: air_temperature\n", " units: K
<xarray.DataArray 'air_temperature_at_2_metres' (time: 504, latitude: 90,\n", " longitude: 180)>\n", "[8164800 values with dtype=float32]\n", "Coordinates:\n", " * time (time) datetime64[ns] 1979-01-16T11:30:00 ... 2020-12-16T11:30:00\n", " * latitude (latitude) float32 -88.88 -86.88 -84.88 ... 85.12 87.12 89.12\n", " * longitude (longitude) float32 0.875 2.875 4.875 6.875 ... 354.9 356.9 358.9\n", "Attributes:\n", " long_name: 2 metre temperature\n", " nameCDM: 2_metre_temperature_surface\n", " nameECMWF: 2 metre temperature\n", " product_type: analysis\n", " shortNameECMWF: 2t\n", " standard_name: air_temperature\n", " units: K
<xarray.DataArray 'air_temperature_at_2_metres' (time: 1, latitude: 1,\n", " longitude: 1)>\n", "array([[[280.93103]]], dtype=float32)\n", "Coordinates:\n", " * time (time) datetime64[ns] 1979-01-16T11:30:00\n", " * latitude (latitude) float32 37.12\n", " * longitude (longitude) float32 238.9\n", "Attributes:\n", " long_name: 2 metre temperature\n", " nameCDM: 2_metre_temperature_surface\n", " nameECMWF: 2 metre temperature\n", " product_type: analysis\n", " shortNameECMWF: 2t\n", " standard_name: air_temperature\n", " units: K
<xarray.DataArray 'air_temperature_at_2_metres' (latitude: 2, longitude: 6)>\n", "array([[280.93103, 272.38843, 271.50885, 270.77847, 268.52744, 267.5718 ],\n", " [276.909 , 268.4416 , 265.8512 , 264.58868, 265.966 , 263.49173]],\n", " dtype=float32)\n", "Coordinates:\n", " time datetime64[ns] 1979-01-16T11:30:00\n", " * latitude (latitude) float32 37.12 39.12\n", " * longitude (longitude) float32 238.9 240.9 242.9 244.9 246.9 248.9\n", "Attributes:\n", " long_name: 2 metre temperature\n", " nameCDM: 2_metre_temperature_surface\n", " nameECMWF: 2 metre temperature\n", " product_type: analysis\n", " shortNameECMWF: 2t\n", " standard_name: air_temperature\n", " units: K
<xarray.DataArray 'air_temperature_at_2_metres' (time: 1, latitude: 3,\n", " longitude: 7)>\n", "array([[[280.93103, 272.38843, 271.50885, 270.77847, 268.52744, 267.5718 ,\n", " 267.39337],\n", " [276.909 , 268.4416 , 265.8512 , 264.58868, 265.966 , 263.49173,\n", " 264.07922],\n", " [269.05563, 266.26437, 265.55194, 263.19815, 264.29312, 261.77335,\n", " 259.54166]]], dtype=float32)\n", "Coordinates:\n", " * time (time) datetime64[ns] 1979-01-16T11:30:00\n", " * latitude (latitude) float32 37.12 39.12 41.12\n", " * longitude (longitude) float32 238.9 240.9 242.9 244.9 246.9 248.9 250.9\n", "Attributes:\n", " long_name: 2 metre temperature\n", " nameCDM: 2_metre_temperature_surface\n", " nameECMWF: 2 metre temperature\n", " product_type: analysis\n", " shortNameECMWF: 2t\n", " standard_name: air_temperature\n", " units: K
<xarray.DataArray 'air_temperature_at_2_metres' (latitude: 90, longitude: 180)>\n", "array([[228.23622, 228.2012 , 228.1636 , ..., 228.33353, 228.29826,\n", " 228.26756],\n", " [228.58359, 228.43468, 228.2972 , ..., 229.14705, 228.94029,\n", " 228.75558],\n", " [228.8029 , 228.41632, 228.07332, ..., 230.17851, 229.69302,\n", " 229.23451],\n", " ...,\n", " [260.34787, 260.42014, 260.47144, ..., 260.13943, 260.22015,\n", " 260.28165],\n", " [259.83597, 259.8577 , 259.88016, ..., 259.74002, 259.76907,\n", " 259.80435],\n", " [259.41345, 259.4211 , 259.42905, ..., 259.3969 , 259.4033 ,\n", " 259.40765]], dtype=float32)\n", "Coordinates:\n", " * latitude (latitude) float32 -88.88 -86.88 -84.88 ... 85.12 87.12 89.12\n", " * longitude (longitude) float32 0.875 2.875 4.875 6.875 ... 354.9 356.9 358.9
<xarray.DataArray 'air_temperature_at_2_metres' (latitude: 90, longitude: 180)>\n", "array([[228.23622, 228.2012 , 228.1636 , ..., 228.33353, 228.29826,\n", " 228.26756],\n", " [228.58359, 228.43468, 228.2972 , ..., 229.14705, 228.94029,\n", " 228.75558],\n", " [228.8029 , 228.41632, 228.07332, ..., 230.17851, 229.69302,\n", " 229.23451],\n", " ...,\n", " [260.34787, 260.42014, 260.47144, ..., 260.13943, 260.22015,\n", " 260.28165],\n", " [259.83597, 259.8577 , 259.88016, ..., 259.74002, 259.76907,\n", " 259.80435],\n", " [259.41345, 259.4211 , 259.42905, ..., 259.3969 , 259.4033 ,\n", " 259.40765]], dtype=float32)\n", "Coordinates:\n", " * latitude (latitude) float32 -88.88 -86.88 -84.88 ... 85.12 87.12 89.12\n", " * longitude (longitude) float32 0.875 2.875 4.875 6.875 ... 354.9 356.9 358.9
<xarray.Dataset>\n", "Dimensions: (\n", " year: 42,\n", " latitude: 90,\n", " longitude: 180)\n", "Coordinates:\n", " * latitude (latitude) float32 ...\n", " * longitude (longitude) float32 ...\n", " * year (year) int64 ...\n", "Data variables: (12/15)\n", " air_pressure_at_mean_sea_level (year, latitude, longitude) float32 ...\n", " air_temperature_at_2_metres (year, latitude, longitude) float32 ...\n", " air_temperature_at_2_metres_1hour_Maximum (year, latitude, longitude) float32 ...\n", " air_temperature_at_2_metres_1hour_Minimum (year, latitude, longitude) float32 ...\n", " dew_point_temperature_at_2_metres (year, latitude, longitude) float32 ...\n", " eastward_wind_at_100_metres (year, latitude, longitude) float32 ...\n", " ... ...\n", " northward_wind_at_100_metres (year, latitude, longitude) float32 ...\n", " northward_wind_at_10_metres (year, latitude, longitude) float32 ...\n", " precipitation_amount_1hour_Accumulation (year, latitude, longitude) float32 ...\n", " sea_surface_temperature (year, latitude, longitude) float32 ...\n", " snow_density (year, latitude, longitude) float32 ...\n", " surface_air_pressure (year, latitude, longitude) float32 ...\n", "Attributes:\n", " institution: ECMWF\n", " source: Reanalysis\n", " title: ERA5 forecasts