Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
184 views
in Technique[技术] by (71.8m points)

python - Train_test_split produces unexpected sample size due to rounding

I'm building a logistic regression-based machine learning problem using sklearn. I'm essentially trying to simulate running the model each day to make a decision for the following day.

To get train_test_split to take the set of choices available on the "test" day, I just set the split to:

splt = 1/observ_days

The size of the observation set is the observed days * available choices, which changes slightly depending on availability, but in the problem case is 45. So the total observations = 45 * observ_days, and the test set should be 45.

The challenges is that train_test_split always rounds up, so when you get a common floating number issue, it could produce an unexpected split. Specifically in my case when the observed days is 744. 1/744 = 0.0013440860215053765. The size of the total data at that moment is 33480. In normal math, 33480 * the split = 45, like it should. But Python comes up with 45.00000000000001, so train_test_split gives me 46 test observations.

That's a problem because in my case the 46th observation is actually from another day. Is there a way to force train_test_split to round down? Or impute the exact size of the train/test set manually?

question from:https://stackoverflow.com/questions/65900883/train-test-split-produces-unexpected-sample-size-due-to-rounding

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you check scikit-learn documentation for train_test_split you'll notice you can specify train_size or test_size as both float (as a proportion of whole dataset to be used) and int (as a specific number of datapoints to be included).

In your case you could just specify test_size = 45 to always take exactly 45 datapoints for the test_set.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...