Data Dependencies – Shameful Blog

Its important to consider the short and-long term health of the data you’re working with. In machine learning environments, be sure to consider the following:

Necessity: Does the data really help improve our models, or is the complexity it adds not worth the improvement it offers to the model?

Reliability: What happens if a data source is unavailable, and how can you identify data access failures?

Correlations: Are certain inputs so entangled that we need to do additional work to separate them?

Versioning: Do the data sources ever chane? If so, what is the effect of that change, and how often is change likely to occur?

Feedback loops: Can inputs be impacted by outputs? For example: a system for university rankings that considers percentage of applicants who were accepted could create a loop where the highest ranked institutions are, due to their rankings, considered even more prestigious, ensuring more applicants, creating an even higher score

Leave a Comment Cancel Reply