You can use soda data checks, to test quality of your data in Features DataFrame.
Checks are written in Soda CL (Checks Language).
Documentation can be found here: https://docs.soda.io/soda-cl/soda-cl-overview.html
In the documentation checks start with definition that looks like checks for dim_product:
dim_product
is the name of the DataFrame
we want to check. Here you don’t have to define it, it’s done automatically by odap
framework.
In case you run orchestration or dry run on more than one feature, features are first joined to one DataFrame and then the checks are ran on it.
In provided documentation checks are in YAML
format. To apply them on your features, they need to be rewritten to Python(json)
.
<aside> 💡 To some sort of success an online YAML → python converter can be used.
</aside>
Checks in YAML
checks for my_dataframe:
- invalid_count(column_name) = 0:
valid values: ['hodnota1', 'hodnota2', 'hodnota3']
- missing_percent(web_visit_count) < 50%
Checks in Python
dq_checks = [
{
'invalid_count(column_name) = 0': {
"valid values": ["hodnota1", "hodnota2", "hodnota3"]
}
},
'missing_percent(web_visit_count) < 50%'
]
You can define checks inside Feature Notebook by adding dq_checks
list. In that list write needed checks in python syntax more in Soda CL Documentation.