I'm building a logistic regression-based machine learning problem using sklearn
. I'm essentially trying to simulate running the model each day to make a decision for the following day.
To get train_test_split
to take the set of choices available on the "test" day, I just set the split to:
splt = 1/observ_days
The size of the observation set is the observed days * available choices, which changes slightly depending on availability, but in the problem case is 45. So the total observations = 45 * observ_days, and the test set should be 45.
The challenges is that train_test_split always rounds up, so when you get a common floating number issue, it could produce an unexpected split. Specifically in my case when the observed days is 744. 1/744 = 0.0013440860215053765. The size of the total data at that moment is 33480. In normal math, 33480 * the split = 45, like it should. But Python comes up with 45.00000000000001, so train_test_split
gives me 46 test observations.
That's a problem because in my case the 46th observation is actually from another day. Is there a way to force train_test_split
to round down? Or impute the exact size of the train/test set manually?
question from:
https://stackoverflow.com/questions/65900883/train-test-split-produces-unexpected-sample-size-due-to-rounding 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…