Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
154 views
in Technique[技术] by (71.8m points)

python - Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

First I made two tables(RDD) to use following commands

rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) )
rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) )

The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows

[((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1), ((u'1842225', u'namys', u'05/23/2005 03:20:00 PM'), 1)]

and similarly the keys in second RDD are BibNum, ItemCollection and Itemlocation. And the values are as following:

[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1), ((u'3011076', u'ncrdr', u'qna'), 1)]

Once I created two RDDs, I tried to join those two to use rdd3=rdd1.join(rdd2) After it, when I checked the value of rdd3 to use rdd3.take(2). Following error happened.

IndexError: list index out of range

I do not know why it happend. Please enlighten me if you know the reason. If you have any doubts for my question or code, just let me know I will try to clarify it. Thanks

edit--- I put up my sample input data for each RDD

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,
1982511,10039952527,jcvhs,ncvidnf,VHS J796.2 KNOW_YO 2000,08/11/2005 01:52:00 PM,,,,,,,
2026467,10040985615,accd,nacd,CD 782.421642 Y71T,10/19/2005 07:47:00 PM,,,,,,,
2174698,10047696215,jcbk,ncpic,E KROSOCZ,12/29/2005 03:42:00 PM,,,,,,,
1602768,10028318730,jcbk,ncpic,E BLACK,10/08/2005 02:15:00 PM,,,,,,,
2285195,10053424767,accd,cacd,CD 782.42166 F19R,09/30/2005 10:16:00 AM,,,,Input,BinNumber,Date,BinNumber+Month
2245955,10048392665,jcbk,ncnf,J949.73 Or77S 2004,12/05/2005 05:03:00 PM,,,,,,,
770918,10044828100,jcbk,ncpic,E HILL,07/22/2005 03:17:00 PM,,,,,,,

.

BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount,,,,,,,,,,,,,
3011076,A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.,O'Ryan| Ellie,1481425730| 1481425749| 9781481425735| 9781481425742,2014,Simon Spotlight|,Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction,jcbk,ncrdr,Floating,qna,09/01/2017,1,,,,,,,,,,,,,
2248846,Naruto. Vol. 1| Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy].,Kishimoto| Masashi| 1974-,1569319006,2003| c1999.,Viz|,Ninja Japan Comic books strips etc| Comic books strips etc Japan Translations into English| Graphic novels,acbk,nycomic,NA,lcy,09/01/2017,1,,,,,,,,,,,,,
3209270,Peace| love & Wi-Fi : a ZITS treasury / by Jerry Scott and Jim Borgman.,Scott| Jerry| 1955-,144945867X| 9781449458676,2014,Andrews McMeel Publishing|,Duncan Jeremy Fictitious character Comic books strips etc| Teenagers United States Comic books strips etc| Parent and teenager Comic books strips etc| Families Comic books strips etc| Comic books strips etc| Comics Graphic works| Humorous comics,acbk,nycomic,NA,bea,09/01/2017,1,,,,,,,,,,,,,
1907265,The Paris pilgrims : a novel / Clancy Carlile.,Carlile| Clancy| 1930-,786706155,c1999.,Carroll & Graf|,Hemingway Ernest 1899 1961 Fiction| Biographical fiction| Historical fiction,acbk,cafic,NA,cen,09/01/2017,1,,,,,,,,,,,,,
1644616,Erotic by nature : a celebration of life| of love| and of our wonderful bodies / edited by David Steinberg.,,094020813X,1991| c1988.,Red Alder Books/Down There Press|,Erotic literature American| American literature 20th century,acbk,canf,NA,cen,09/01/2017,1,,,,,,,,,,,,,

edit---------------------------------------------- date_count --> DataFrame[BibNum : string, ItemCollection : string, CheckoutDateTime : string, count : BigInt ] .. shows likes this but when I checked the value of it using date_count.take(2) , it shows error like this : Input doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.

df_final schema looks like this: DataFrame[BibNum:string, ItemType:string, ItemCollection:string, ItemBarcode:string, CallNumber:string, CheckoutDateTime:string, Title:string, Author:string, ISBN:string, PublicationYear:string, Publisher:string, Subjects:string, FloatingItem:string, ItemLocation:string, ReportDate:string, ItemLocation:string, : string , : string, : string .... : string , : string ]

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So I'll try to answer your question. The solution might be syntactically haywire but I'll try to do my best (I don't have an environment to test right now). Let me know if this is what you are looking for otherwise I can help you with fine tuning the solution.

Here is the documentation for Join in Pyspark

So when you read the files :

rdd1=sc.textFile('checkouts').map(lambda line:line.split(','))
rdd2=sc.textFile('inventory2').map(lambda line:line.split(','))
# Define the headers for both the files
rdd1_header = rdd1.first()
rdd2_header = rdd2.first()

# Define the dataframe
rdd1_df = rdd1.filter(lambda line: line != rdd1_header).toDF(rdd1_header)
rdd2_df = rdd2.filter(lambda line: line != rdd2_header).toDF(rdd2_header)

common_cols = [x for x in rdd1_df.columns if x in rdd2_df.columns]

df_final = rdd1_df.join(rdd2_df, on=common_cols)
date_count = df_final.groupBy(["BibNum", "ItemCollection", "CheckoutDateTime"]).count()

EDITS :

1) Your error : "pyspark.sql.utils.AnalysisException: u"Reference 'ItemCollection' is ambiguous, could be ItemCollection#3, ItemCollection#21" is due to multiple columns being generated after the join. What you need to do is include all common columns in your join. I will mention it in the code.

2) Another issue : Some of weird parts are added into the last part of each RDD, such as -- [Row(BibNum=u'1842225', ItemBarcode=u'10035249209', ItemType=u'acbk', ItemCollection=u'namys', CallNumber=u'MYSTERY ELKINS1999', CheckoutDateTime=u'05/23/2005 03:20:00 PM', =u'', =u'', =u'', =u'', =u'', =u'', =u'')

For this, you had mentioned your CSV file as follows :

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,

Now if you can see there are a lot of trailing commas after the date column. i.e. ',,,,,,,', which is giving those extra empty columns (after the split on commas) which you can drop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...