I want to find all of the rows which have duplicates in the columns of city, round_latitude, and round_longitude. So, if two rows share the same values in each of those columns, it would be returned.
I'm not exactly sure what is going on here: I'm certain that there are duplicates in the dataset. No error is returned when running In[38], the column names are returned but there are no entries. What am I doing wrong here? How can I fix this?
If it helps, I'm also working off of some of the code in this guide. (The format is HTML.)
# In[29]:
def dl_by_loc(path):
endname = "USA_downloads.csv"
with open(path + endname, "r") as f:
data = pd.read_csv(f)
data.columns = ["date","city","coords","doi","latitude","longitude","round_latitude","round_longitude"]
data = data.groupby(["round_latitude","round_longitude","city"]).count()
data = data.rename(columns = {"date":"downloads"})
return data["downloads"]
# In[30]:
downloads_by_coords = dl_by_loc(path)
len(downloads_by_coords)
# In[31]:
downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["round_latitude","round_longitude","city","downloads"]
# In[32]:
downloads_by_coords.head()
# In[38]:
by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["round_latitude","round_longitude","city"])]
coord_dupes
Here are a few lines from the data, as requested:
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1042/BJ20091140,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1096/fj.05-5309fje,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1186/1478-811X-11-15,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:21,Houston,"29.7604267,-95.3698028",10.1039/P19730002379,29.7604267,-95.36980279999999,30.0,-95.0
See Question&Answers more detail:
os