Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
172 views
in Technique[技术] by (71.8m points)

python - PySpark - String matching to create new column

I have a dataframe like:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha 

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

Is regex or grep better here? What kind of function should I try? Thanks!

EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In short:

regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))


In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(s+)(w+)') means

  • (.) - Any character (except newline)
  • (by) - Word by in the text
  • (s+) - One or many spaces
  • (w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks notebook

Note:

regexp_extract(col('Notes'), '.bys+(w+)', 1)) seems much cleaner version and check the Regex in use here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...