scripting - How to handle commas within a CSV file being read by bash script

Question

Welcome To Ask or Share your Answers For Others

scripting - How to handle commas within a CSV file being read by bash script

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scripting - How to handle commas within a CSV file being read by bash script

I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).

The content of the CSV file looks similar to this:

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.

Currently I have this loop:

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input

which produces this output for the sample data given above:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]

As you can see, the third entry is parsed incorrectly. I want it to output

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:59:18+0000

If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == """ && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile

Sample output:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == """ &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

With Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile

This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ ]*("[^"]*(""[^"]*)*"|[^,]*)[ ]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ ]*"?|"?[ ]*,$/, "", field)
    gsub(/""/, """, field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile

Categories

scripting - How to handle commas within a CSV file being read by bash script

scripting - How to handle commas within a CSV file being read by bash script

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags