Human behavior is imperfect

I was surprised when I learned that a lot of machine learning models are trained on perfect data (i.e., data where the underlying truth associates exactly one outcome with the input data). Humans are unpredictable because of our non-deterministic (i.e., not perfectly predictable) choices, so shouldn’t ML models aiming to predict human behavior be trained on imperfect data?

In this open source publication, my Lirio colleagues Andrew Starnes and Anton Dereventsov built a simulator that can be used to generate an arbitrary amount of synthetic datapoints - in other words, imperfect data that might better mimic human decision making than do perfect data.

Abstract: We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction. https://arxiv.org/abs/2301.09454