All of the following classes and functions are designed to calculate features in multiple time windows such as 14d, 30d and 90d. This calculation is is done efficiently over one group by making it almost N times faster than calculating each window separately and joining the results.

The helper functions are designed in a declarative way meaning the user only specifies what the result of the operation should be and the calculation is done in the background.

All examples assume these imports:

import daipe as dp

from typing import List
from pyspark.sql import Column, DataFrame, functions as f

from odap.feature_factory import time_windows as tw

Table of contents


WindowedDataFrame

WindowedDataFrame(self, df: DataFrame, time_column: str, time_windows: List[str]) -> WindowedDataFrame

DataFrame which allows for time_windowed calculations

wdf = tw.WindowedDataFrame(
    df=spark.read.table("odap_digi_sdm_l2.web_visits"),
    time_column="visit_timestamp",
    time_windows=["14d", "30d", "90d"],
)

Methods

time_windowed

time_windowed(self, group_keys: List[str], agg_columns_function: Callable[[str], List[WindowedColumn]] = lambda x: list(), non_agg_columns_function: Callable[[str], List[Column]] = lambda x: list(), extra_group_keys: List[str] = [], unnest_structs: bool = False) -> WindowedDataFrame

Returns a new WindowedDataFrame with calculated aggregated and non aggregated columns