This is my original pyspark dataframe.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 1| 2|
| 1| 2| 2|
| 1| 3| 2|
| 1| 2| 1|
| 2| 1| 2|
| 2| 3| 2|
| 2| 2| 1|
| 3| 1| 2|
| 3| 3| 2|
| 3| 2| 1|
+----+----+----+
On sorting df
df = df.sort('col2')
test = df.select('col1','col2','col3')
test.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 3| 1| 2|
| 2| 1| 2|
| 1| 1| 2|
| 1| 2| 1|
| 3| 2| 1|
| 1| 2| 2|
| 2| 2| 1|
| 1| 3| 2|
| 3| 3| 2|
| 2| 3| 2|
+----+----+----+
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2| 1| 2|
| 3| 1| 2|
| 1| 1| 2|
| 1| 2| 2|
| 3| 2| 1|
| 1| 2| 1|
| 2| 2| 1|
| 3| 3| 2|
| 2| 3| 2|
| 1| 3| 2|
+----+----+----+
We can see that the row order of the test is different from df, I don't know what happened, can someone help me understand?
question from:
https://stackoverflow.com/questions/65915215/pyspark-df-select-is-disordered-after-df-sort 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…