if else in spark passing an condition to find the value from csv file

Question

Welcome To Ask or Share your Answers For Others

if else in spark passing an condition to find the value from csv file

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

if else in spark passing an condition to find the value from csv file

I want to read csv file into dfTRUEcsv

How to get the value (03,05) and 11 as string in the below eg I want to pass those string as a parameter to get files from that folder

i will pass (03,05) and 11 as parameters 
if TRUE , for each Loop start Folder3 ;
                              Folder5 ;   


Folder11

+-------------+--------------+--------------------+-----------------+--------+
|Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload|
+-------------+--------------+--------------------+-----------------+--------+
|         2019|             2|                HIST|         20190829|   FALSE|
|         2019|             3|                HIST|         20190829|    TRUE|
|         2019|             4|                HIST|         20190829|   FALSE|
|         2019|             5|                HIST|         20190829|    TRUE|
|         2019|            11|                HIST|         20190829|   FALSE|
+-------------+--------------+--------------------+-----------------+--------+

       if the file has column isreload =='TRUE' 
                    var Foldercolumn Calendar_month 
                     Foldercolumn =     03
                     Foldercolumn =     05


      else
                 var Foldercolumn  max(Calendar_year ),max(Calendar_month )
                       Foldercolumn =     11

      end if

below is my spark code for the above requirement

val destinationContainerPath= "Finance/Data"
val dfCSVLogs = readCSV(s"$destinationContainerPath/sourcecsv.csv")

val dfTRUEcsv = dfCSVLogs.select(dfCSVLogs.col("*")).filter("isreload =='TRUE'")

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:16:03+0000

//read input control CSV file 
    scala> val df = spark.read.format("csv").option("header", "true").load("file.csv")
    scala> df.show(false)
    +-------------+--------------+--------------------+-----------------+--------+
    |Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload|
    +-------------+--------------+--------------------+-----------------+--------+
    |2018         |12            |HIST                |20190829         |FALSE   |
    |2019         |2             |HIST                |20190829         |FALSE   |
    |2019         |3             |HIST                |20190829         |TRUE    |
    |2019         |4             |HIST                |20190829         |FALSE   |
    |2019         |11            |HIST                |20190829         |FALSE   |
    |2019         |5             |HIST                |20190829         |TRUE    |
    +-------------+--------------+--------------------+-----------------+--------+
    //initialize variable for max year and month 
    //note: below execution cam be modified on the basis of your requirement simply use filter to get max of particular condition

    scala> val maxYearMonth =  df.select(struct(col("Calendar_year").cast("Int"), col("Calendar_month").cast("Int")) as "ym").agg(max("ym") as "max").selectExpr("stack(1,max.col1,max.col2) as (year, month)").select( concat(col("year"), lit("/") ,col("month"))).rdd.collect.map( r => r(0)).mkString
           res56: maxYearMonth = 2019/11

    //Adding column temparary in input DataFrame
    scala> val df2 = df.withColumn("strFoldercolumn", when(col("isreload") === "TRUE", concat(col("Calendar_year"), lit("/"),col("Calendar_month"))).otherwise(lit(maxYearMonth)))
    scala> df2.show(false)
    +-------------+--------------+--------------------+-----------------+--------+-----------+
    |Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload|strFoldercolumn|
    +-------------+--------------+--------------------+-----------------+--------+-----------+
    |2018         |12            |HIST                |20190829         |FALSE   |2019/11    |
    |2019         |2             |HIST                |20190829         |FALSE   |2019/11    |
    |2019         |3             |HIST                |20190829         |TRUE    |2019/3     |
    |2019         |4             |HIST                |20190829         |FALSE   |2019/11    |
    |2019         |11            |HIST                |20190829         |FALSE   |2019/11    |
    |2019         |5             |HIST                |20190829         |TRUE    |2019/5     |
    +-------------+--------------+--------------------+-----------------+--------+-----------+


    //move value of column strFoldercolumn into strFoldercolumn list variable 
    scala> val strFoldercolumn = df2.select("strFoldercolumn").distinct.rdd.collect.toList
    strFoldercolumn: List[org.apache.spark.sql.Row] = List([2019/5], [2019/11], [2019/3])

    //lopping each value
    scala>strFoldercolumn.foreach { x =>
         | val csvPath =  "folder/" + x.toString + "/*.csv"
         | val srcdf = spark.read.format("csv").option("header", "true").load(csvPath)
         | // Write logic to copy or write srcdf to your destination folder
         | 
         | }

Categories

if else in spark passing an condition to find the value from csv file

if else in spark passing an condition to find the value from csv file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags