Caching
Caching in ODAP framework should follow the same rules as outside the framework
- Cache only the data that will be used multiple times: Caching all of the data in a Spark job can be inefficient, as it may use up all of the available memory and cause performance issues. Instead, it is best to only cache the data that will be used multiple times in the job.
- Use the right storage level: Spark provides several storage levels for caching data, including MEMORY_ONLY, MEMORY_AND_DISK, and OFF_HEAP. It is important to choose the right storage level based on the amount of data being cached and the available memory.
- Consider the eviction policy: When the memory used for caching exceeds the available memory, Spark will evict some of the cached data to make room for new data. It is important to choose the right eviction policy based on the usage patterns of the data.
- Use the
cacheTable
or persist
method: Spark provides the cacheTable
and persist
methods for caching data in memory. The cacheTable
method is specific to Spark SQL and is recommended for caching data that will be used in SQL queries, while the persist
method can be used for any type of data.
- Monitor cache usage: It is important to monitor the usage of the cache to ensure that it is being used efficiently and that there are no performance issues. Spark provides several metrics for monitoring cache usage, including the amount of data stored in the cache and the eviction rate.
For more information read this article
Proper partitioning
When using time windowed features it is very important to have the input tables partitioned by the time column (DateType or TimestampType) which is used to filter the data to the maximum time window.
For example using a dataset with the schema of client_id, event_date, amount
and the time windows of 14 days, 30 days and 90 days,
it is recommended to partition the table by event_date
and use
.filter(f.col("event_date").between(f.col("timestamp") - f.expr("interval 90 days"), f.col("timestamp"))
when loading the data.
More information and examples can be found in the time windowed features section.