All of the following classes and functions are designed to calculate features in multiple time windows such as 14d, 30d and 90d. This calculation is is done efficiently over one group by making it almost N times faster than calculating each window separately and joining the results.
The helper functions are designed in a declarative way meaning the user only specifies what the result of the operation should be and the calculation is done in the background.
All examples assume these imports:
import daipe as dp
from typing import List
from pyspark.sql import Column, DataFrame, functions as f
from odap.feature_factory import time_windows as tw
Table of contents
WindowedDataFrame(self, df: DataFrame, time_column: str, time_windows: List[str]) -> WindowedDataFrame
DataFrame which allows for time_windowed calculations
df : DataFrame, input DataFrametime_column : str, name of Date or Timestamp Column in df, which is subtracted from the run_date to create the time window intervaltime_windows: List[str], list of time windows as a [0-9]+[dhw], suffixes d = days, h = hours, w = weekswdf = tw.WindowedDataFrame(
df=spark.read.table("odap_digi_sdm_l2.web_visits"),
time_column="visit_timestamp",
time_windows=["14d", "30d", "90d"],
)
time_windowed(self, group_keys: List[str], agg_columns_function: Callable[[str], List[WindowedColumn]] = lambda x: list(), non_agg_columns_function: Callable[[str], List[Column]] = lambda x: list(), extra_group_keys: List[str] = [], unnest_structs: bool = False) -> WindowedDataFrame
Returns a new WindowedDataFrame with calculated aggregated and non aggregated columns