Streaming GPU DataFrames (cudf)

The streamz.dataframe module provides a DataFrame-like interface on streaming data as described in the dataframes documentation. It provides support for dataframe-like libraries such as pandas and cudf. This documentation is specific to streaming GPU dataframes using cudf.

The example in the dataframes documentation is rewritten below using cudf dataframes just by replacing the pandas module with cudf:

import cudf
from streamz.dataframe import DataFrame

example = cudf.DataFrame({'name': [], 'amount': []})
sdf = DataFrame(stream, example=example)

sdf[sdf.name == 'Alice'].amount.sum()

Supported Operations

Streaming cudf dataframes support the following classes of operations:

  • Elementwise operations like df.x + 1
  • Filtering like df[df.name == 'Alice']
  • Column addition like df['z'] = df.x + df.y
  • Reductions like df.amount.mean()
  • Windowed aggregations (fixed length) like df.window(n=100).amount.sum()

The following operations are not yet supported with cudf (as of version 0.8):

  • Groupby-aggregations like df.groupby(df.name).amount.mean()
  • Windowed aggregations (index valued) like df.window(value='2h').amount.sum()
  • Windowed groupby aggregations like df.window(value='2h').groupby('name').amount.sum()

Window-based Aggregations with cudf are supported just as explained in the dataframes documentation. Support for groupby operations is expected to be added in the future.