All of the following classes and functions are designed to calculate features in multiple time windows such as 14d
, 30d
and 90d
. This calculation is is done efficiently over one group by making it almost N
times faster than calculating each window separately and joining the results.
The helper functions are designed in a declarative way meaning the user only specifies what the result of the operation should be and the calculation is done in the background.
All examples assume these imports:
import daipe as dp
from typing import List
from pyspark.sql import Column, DataFrame, functions as f
from odap.feature_factory import time_windows as tw
Table of contents
WindowedDataFrame(self, df: DataFrame, time_column: str, time_windows: List[str]
) -> WindowedDataFrame
DataFrame which allows for time_windowed calculations
df
: DataFrame, input DataFrametime_column
: str, name of Date
or Timestamp
Column in df
, which is subtracted from the run_date
to create the time window intervaltime_windows
: List[str]
, list of time windows as a [0-9]+[dhw]
, suffixes d
= days, h
= hours, w
= weekswdf = tw.WindowedDataFrame(
df=spark.read.table("odap_digi_sdm_l2.web_visits"),
time_column="visit_timestamp",
time_windows=["14d", "30d", "90d"],
)
time_windowed(self, group_keys: List[str], agg_columns_function: Callable[[str], List[WindowedColumn]] = lambda x: list(), non_agg_columns_function: Callable[[str], List[Column]] = lambda x: list(), extra_group_keys: List[str] = [], unnest_structs: bool = False
) -> WindowedDataFrame
Returns a new WindowedDataFrame with calculated aggregated and non aggregated columns