python - PySpark - String matching to create new column

Question

Welcome To Ask or Share your Answers For Others

python - PySpark - String matching to create new column

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - PySpark - String matching to create new column

I have a dataframe like:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

Is regex or grep better here? What kind of function should I try? Thanks!

EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:58:56+0000

In short:

regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))

In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(s+)(w+)') means

(.) - Any character (except newline)
(by) - Word by in the text
(s+) - One or many spaces
(w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks notebook

Note:

regexp_extract(col('Notes'), '.bys+(w+)', 1)) seems much cleaner version and check the Regex in use here

Categories

python - PySpark - String matching to create new column

python - PySpark - String matching to create new column

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

In short:

`regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))`

In Detail:

Note:

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags