Big Data Arrays with Python

Kipling Crossing
4 min readJul 26, 2021
source: https://speakerdeck.com/rabernat/zarr-ogc-2020

Big Data

Numpy arrays are great! They provide an intuitive high-level interface analysing and manipulating multi-dimensional data. They are also very efficient at performing arithmetic and linear-algebraic tasks, making them the perfect tool for working with structured, grid-like data.

To provide this performance, numpy arrays are held in memory and therefore there is a limit to the amount of data that can be considered in one array. Often this isn’t a problem and we can enjoy the plethora of methods that are provided by the numpy library for efficient computation. For some tasks, the data is too large to fit into memory, causing a Big Data problem. We could throw more memory at the problem, but in most cases, it is impossible to foresee the extent of the data that we might be working with.

numpy MemoryError example

Chunked Arrays

In this article, I will introduce the concept of chunked arrays that deal with this Big Data problem. I’ll also include some examples so you can following along.

Chunked arrays are arrays that break large arrays into smaller, manageable sided numpy arrays; called chunks. These chunks are stored on disk and reference via an index. Their indexes are the relative coordinates of the chunks in relation to the overall array. This chunking allows our memory to hold and process the data in smaller, bite-sized arrays.

Zarr

Zarr logo

Zarr is a lightweight library for managing chunked arrays. It allows the creation of arrays with a defined chunked size. These arrays written to disk and their data are able to be accessed and updated on the fly. The following example demonstrates the numpy-like interface for reading and writing data to persisted arrays:

The only limit to the size of Zarr arrays is the storage hardware, and with support for cloud storage, the size is practically limitless.

Rechunker

rechunking visualization. source: https://speakerdeck.com/rabernat/rechunker-esip-summer-2020

Reading data from a chunked array can be very efficient if done right. If you read (or slice) a section of the array that is within one chunk, loading the data into memory will be pretty quick. However, when reading data across multiple chunks, performance can become considerably worse. You also don’t want to make your chunks too big, otherwise, this may not only impact performance, but your data may not fit into memory at all! Therefore, your chunks must be just the right size based on your usage of the array. There may be cases where you can’t choose the initial chunk size of the array. This is where the rechunker package comes in handy. The following is an example of changing the chunks of a 2d array form (1000, 1000) to (500, 2000):

Rechunker is a powerful and efficient way of rechunking chunked arrays.

Dask

Dask logo

Dask is a powerful library for parallel computing. It provides an intuitive and flexible interface for processing data on distributed systems. Dask also provides a range of high-level interfaces for reading, writing and manipulating data. In particular, Dask arrays have a, very close to, numpy-like experience and include many of the same built-in methods. Further to this, like Zarr arrays, Dask arrays are also chunked allowing processing to be done on each chunk. This means you will never run into memory problems and the worst-case scenario is that processing will take forever on your local machine, requiring you to throw some parallel processors at the problem.

The following example will demonstrate the numpy-like interface that dask-arrays provide

Conclusion

Without having to learn a new python interface for dealing with arrays, these handfuls of tools can help us solve our big data problems when analyzing and manipulating arrays in python.

--

--

Kipling Crossing

I do many things including: Open-Source, Geo-spatial Data Science, Scientific utility apps, Micro-Python and Writing