Caching

Caching in ODAP framework should follow the same rules as outside the framework

  1. Cache only the data that will be used multiple times: Caching all of the data in a Spark job can be inefficient, as it may use up all of the available memory and cause performance issues. Instead, it is best to only cache the data that will be used multiple times in the job.
  2. Use the right storage level: Spark provides several storage levels for caching data, including MEMORY_ONLY, MEMORY_AND_DISK, and OFF_HEAP. It is important to choose the right storage level based on the amount of data being cached and the available memory.
  3. Consider the eviction policy: When the memory used for caching exceeds the available memory, Spark will evict some of the cached data to make room for new data. It is important to choose the right eviction policy based on the usage patterns of the data.
  4. Use the cacheTable or persist method: Spark provides the cacheTable and persist methods for caching data in memory. The cacheTable method is specific to Spark SQL and is recommended for caching data that will be used in SQL queries, while the persist method can be used for any type of data.
  5. Monitor cache usage: It is important to monitor the usage of the cache to ensure that it is being used efficiently and that there are no performance issues. Spark provides several metrics for monitoring cache usage, including the amount of data stored in the cache and the eviction rate.

For more information read this article

Proper partitioning

When using time windowed features it is very important to have the input tables partitioned by the time column (DateType or TimestampType) which is used to filter the data to the maximum time window.

For example using a dataset with the schema of client_id, event_date, amount

and the time windows of 14 days, 30 days and 90 days,

it is recommended to partition the table by event_date and use

.filter(f.col("event_date").between(f.col("timestamp") - f.expr("interval 90 days"), f.col("timestamp"))

when loading the data.

More information and examples can be found in the time windowed features section.