I have two data frames SF and OF.
SF:
PartNumber ParentPartNumber Webname Brand Value_Size Full Description ImagePath Short Description Weight RetailPriceEUR Isfeatured
2.5 2 Sidi Si S Honeycomb elastic https://link1,https://link2 Honey 2.3 331 yes
2.6 2 Sidi Si M Honeycomb elastic https://link1,https://link2 Honey 2.3 331 yes
2.7 2 Sidi Si L Honeycomb elastic https://link1,https://link2 Honey 2.3 331 yes
3.2 3 Shoei Sho S E.Q.R.S. https://link3 ERQS 1.5 331 yes
3.3 3 Shoei Sho M E.Q.R.S. https://link3 ERQS 1.5 331 yes
2.9 2 Sidi Si XL Honeycomb elastic https://link1,https://link2 Honey 2.3 331 yes
OF:
SKU Published Name Parent Size Full Description ImagePath ShortDescription Weight RetailPriceEUR Isfeatured height
4 1 Bec
8 1 Lin
What I want to do is to add an extra row before each duplicated row present in SF and append it to the OF data frame. For example, if there are duplicates in a parent like 2,2,3,3 the first row of 2 and the second row of 2 needs to be copied, and in addition to all rows, there have to be an extra one added before them with info as in the description. So the end result should look like
Result (SF rows appended in OF):
SKU Published Name Parent Size Full Description ImagePath ShortDescription Weight RetailPriceEUR Isfeatured height
4 1 Bec
8 1 Lin
2 1 Sidi S,M,L,XL Honeycomb elastic https://link1,https://link2 yes
2.5 0 Honey 2 S Honey 2.3 331 yes
2.6 0 Honey 2 M Honey 2.3 331 yes
2.7 0 Honey 2 L Honey 2.3 331 yes
2.9 0 Honey 2 XL Honey 2.3 331 yes
3 (extra) 1 Sho S,M E.Q.R.S. https://link3 yes
3 0 ERQS 3 S EQRS 1.5 33 yes
3 0 ERQS 3 M ERQS 1.5 33 yes
This is what I have done so far
# create a DataFrame keep SF all duplicated rows
cond = SF.duplicated(keep=False)
df = SF.loc[cond, ['PartNumber', 'ParentPartNumber', 'Webname' , 'Value_Size']].copy()
df['Published'] = 0
# rename columns, and sorted columns
df.columns = ['SKU', 'Parent', 'Name', 'Size', 'Published']
df = df[OF.columns].copy()
# drop duplicates(keep the first duplicate row), and assign 1 to Published column
dfn = df.drop_duplicates(keep='first').copy()
# get rows with duplicated ParentPartNumber
SFs = SFs[SFs['ParentPartNumber'].duplicated(keep = False)]
#group by size
gk =( SFs.groupby('ParentPartNumber'['Size'].agg(','.join).reindex(SFs['ParentPartNumber'].unique()).reset_indx())
# drop duplicates(keep the first duplicate row),
SFs = SFs.drop_duplicates(subset='ParentPartNumber', keep='first').copy()
dfn['SKU'] = SFs['ParentPartNumber'].values
dfn['Name'] = SFs['WebName'].values
dfn['Attribute 1 value(s)'] = gk['Size'].values
dfn['Published'] = 1
# append first duplicate row to df, and sort index to keep row order
dfn = dfn.append(df).sort_index()
# append the duplicates rows to OF
print(OF.append(dfn))
The issue is it works well with small sample size, however, for large files, it's not matching ParentPartNumber well and producing results where SKU = 3 should be below 3(extra) not above it
SKU Published Name Parent Size
4 1 Bec
8 1 Lin
2 1 Sidi S,M,L,XL
2.5 0 Honey 2 S
2.6 0 Honey 2 M
2.7 0 Honey 2 L
2.9 0 Honey 2 XL
*3 0 Sho 3 S*
3 (extra) 1 ERQS S,M
3 0 ERQS 3 M
EDIT:
I have a source file and have to perform some operations given below to obtain the dataframe shown above. Please note that some of the variables are missing from the sample file. The operations are:
Create new row in OF with values as follows
OF Type = variable
OF SKU = SF ParentPartNumber
OF Name = SF WebName
OF YMM = ,0,0,,
OF Published = 1
OF Is featured = 0
OF Visibility in catalog = visible
OF Description = SF FullDescription
OF Tax status = taxable
OF In stock? = 1
OF Backorders allowed? = 0
OF Sold individually? = 0
OF Allow customer reviews? = 1
OF Images = SF ImagePath (when there is more than 1 link in SF replace | with , (comma))
OF Position = 0
OF Attribute = Size
OF Attribute 1 value(s) = all values from SF Size that belong to the same ParentPartNumber separated with comma
OF Attribute 1 visible = 0
OF Attribute 1 global = 0
then below that row copy all rows that belong to this ParentPartNumber as follows
OF Type = variation
OF SKU = SF PartNumber
OF Name = SF ShortDescription
OF YMM = ,0,0,,
OF Published = 0
OF Is featured = 0
OF Visibility in catalog = visible
OF Tax status = taxable
OF Tax class = parent
OF In Stock = 1
OF Backorders allowed? = 0
OF Sold individually? = 0
OF Weight (kg) = SF Weight
OF Allow customer reviews? = 0
OF Regular price = SF RetailPriceEUR
OF Parent = SF ParentPartNumber
question from:
https://stackoverflow.com/questions/65546310/include-extra-row-before-duplicated-rows