python - No output with .duplicated in pandas?

Question

Welcome To Ask or Share your Answers For Others

python - No output with .duplicated in pandas?

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

python - No output with .duplicated in pandas?

I want to find all of the rows which have duplicates in the columns of city, round_latitude, and round_longitude. So, if two rows share the same values in each of those columns, it would be returned.

I'm not exactly sure what is going on here: I'm certain that there are duplicates in the dataset. No error is returned when running In[38], the column names are returned but there are no entries. What am I doing wrong here? How can I fix this?

If it helps, I'm also working off of some of the code in this guide. (The format is HTML.)

# In[29]:

def dl_by_loc(path):
    endname = "USA_downloads.csv"
    with open(path + endname, "r") as f:
        data = pd.read_csv(f)
        data.columns = ["date","city","coords","doi","latitude","longitude","round_latitude","round_longitude"]
        data = data.groupby(["round_latitude","round_longitude","city"]).count()
        data = data.rename(columns = {"date":"downloads"})
        return data["downloads"]


# In[30]:

downloads_by_coords = dl_by_loc(path)
len(downloads_by_coords)


# In[31]:

downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["round_latitude","round_longitude","city","downloads"]


# In[32]:

downloads_by_coords.head()


# In[38]:

by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["round_latitude","round_longitude","city"])]
coord_dupes

Here are a few lines from the data, as requested:

2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1042/BJ20091140,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1096/fj.05-5309fje,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1186/1478-811X-11-15,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:21,Houston,"29.7604267,-95.3698028",10.1039/P19730002379,29.7604267,-95.36980279999999,30.0,-95.0

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:15:57+0000

dl_by_loc(path) returns a Series with a MultiIndex:

round_latitude  round_longitude  city        
30.0            -95.0            Houston         1
40.0            -75.0            Philadelphia    3
Name: downloads, dtype: int64

If you take a look at the definition of that function, it groups the DataFrame by round_latitude, round_longitude and city columns and counts the number of occurrences. Later on, you convert this to a DataFrame by calling reset_index(). Now, the downloads column is showing how many times each lat, lon, city combination occurred in the original DataFrame. Since it is a groupby result, these combinations are in fact not duplicated because they were aggregated previously. If you want to detect duplicated ones from this DataFrame, you can use:

by_coords[by_coords['downloads']>1]

Your method would still work in the original DataFrame. Note that removing duplicates or grouping data with float type data has some risks. Pandas generally handles them but to make sure, if you want 1-digit precision, you can multiply by 10 and convert to integer.

Categories

python - No output with .duplicated in pandas?

python - No output with .duplicated in pandas?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags