python - Train_test_split produces unexpected sample size due to rounding

Question

Welcome To Ask or Share your Answers For Others

python - Train_test_split produces unexpected sample size due to rounding

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Train_test_split produces unexpected sample size due to rounding

I'm building a logistic regression-based machine learning problem using sklearn. I'm essentially trying to simulate running the model each day to make a decision for the following day.

To get train_test_split to take the set of choices available on the "test" day, I just set the split to:

splt = 1/observ_days

The size of the observation set is the observed days * available choices, which changes slightly depending on availability, but in the problem case is 45. So the total observations = 45 * observ_days, and the test set should be 45.

The challenges is that train_test_split always rounds up, so when you get a common floating number issue, it could produce an unexpected split. Specifically in my case when the observed days is 744. 1/744 = 0.0013440860215053765. The size of the total data at that moment is 33480. In normal math, 33480 * the split = 45, like it should. But Python comes up with 45.00000000000001, so train_test_split gives me 46 test observations.

That's a problem because in my case the 46th observation is actually from another day. Is there a way to force train_test_split to round down? Or impute the exact size of the train/test set manually?

question from:https://stackoverflow.com/questions/65900883/train-test-split-produces-unexpected-sample-size-due-to-rounding

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:15:13+0000

If you check scikit-learn documentation for train_test_split you'll notice you can specify train_size or test_size as both float (as a proportion of whole dataset to be used) and int (as a specific number of datapoints to be included).

In your case you could just specify test_size = 45 to always take exactly 45 datapoints for the test_set.

Categories

python - Train_test_split produces unexpected sample size due to rounding

python - Train_test_split produces unexpected sample size due to rounding

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags