Time windows helper classes and functions

All of the following classes and functions are designed to calculate features in multiple time windows such as 14d, 30d and 90d. This calculation is is done efficiently over one group by making it almost N times faster than calculating each window separately and joining the results.

The helper functions are designed in a declarative way meaning the user only specifies what the result of the operation should be and the calculation is done in the background.

All examples assume these imports:

import daipe as dp

from typing import List
from pyspark.sql import Column, DataFrame, functions as f

from odap.feature_factory import time_windows as tw

Table of contents

WindowedDataFrame

WindowedDataFrame(self, df: DataFrame, time_column: str, time_windows: List[str]) -> WindowedDataFrame

DataFrame which allows for time_windowed calculations

df : DataFrame, input DataFrame
time_column : str, name of Date or Timestamp Column in df, which is subtracted from the run_date to create the time window interval
time_windows: List[str], list of time windows as a [0-9]+[dhw], suffixes d = days, h = hours, w = weeks

wdf = tw.WindowedDataFrame(
    df=spark.read.table("odap_digi_sdm_l2.web_visits"),
    time_column="visit_timestamp",
    time_windows=["14d", "30d", "90d"],
)

Methods

time_windowed

time_windowed(self, group_keys: List[str], agg_columns_function: Callable[[str], List[WindowedColumn]] = lambda x: list(), non_agg_columns_function: Callable[[str], List[Column]] = lambda x: list(), extra_group_keys: List[str] = [], unnest_structs: bool = False) -> WindowedDataFrame

Returns a new WindowedDataFrame with calculated aggregated and non aggregated columns