Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
590 views
in Technique[技术] by (71.8m points)

awk - Bash: Find and replace lines in a file using the lines of another file

I have two files: masterlist.txt that has hundreds of lines of URLs, and toupdate.txt that has a smaller number of updated versions of lines from the masterlist.txt file that need to be replaced.

I'd like to be able to automate this process using Bash, since the creation and utilisation of these lists is already occuring in a bash script.

The server part of the URL is the part that changes, so we could match using the unique part: /whatever/whatever_user.xml, but how to find and replace those lines in masterlist.txt? i.e. how to go through each line of toupdate.txt and as it ends in /f_SomeName/f_SomeName_user.xml, find that ending in masterlist.txt and replace that whole line with the new one?

So https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml becomes https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml for example.

The rest of masterlist.txt needs to stay intact, so we must only find and replace lines that have different servers for the same line endings (IDs).

Structure

masterlist.txt looks like this:

https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://101112url.domain.com/1/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

toupdate.txt looks like this:

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml

Desired Result

Make masterlist.txt look like:

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

Initial workup

I've looked at sed but I don't know how to do the find and replace using lines from the two files?

Here's what I have so far, doing the file handling at least:

#!/bin/bash

#...

while read -r line; do
    # there's a new link on each line
    link="${line}"
    # extract the unique part from the end of each line
    grabXML="${link##*/}"
    grabID="${grabXML%_user.xml}"
    # if we cannot grab the ID, then just set it to use the full link so we don't have an empty string
    if [ -n "${grabID}" ]; then
        identifier=${grabID}
    else
        identifier="${line}"
    fi
    
    ## the find and replace here? ##    

# we're done when we've reached the end of the file
done < "masterlist.txt"
question from:https://stackoverflow.com/questions/65862573/bash-find-and-replace-lines-in-a-file-using-the-lines-of-another-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Would you please try the following:

#!/bin/bash

declare -A map
while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        map[$uniq_part]=$line
    fi
done < "toupdate.txt"

while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        if [[ -n ${map[$uniq_part]} ]]; then
            line=${map[$uniq_part]}
        fi
    fi
    echo "$line"
done < "masterlist.txt" > "masterlist_tmp.txt"

# if the result of "masterlist_tmp.txt" is good enough, uncomment the line below
# mv -f -- "masterlist_tmp.txt" "masterlist.txt"

result:

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml

[Explanations]

  • The associative array map maps the "unique part" such as /f_SomeName/f_SomeName_user.xml to the "full path" such as https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml.
  • The regex (/[^/]+/[^/]*.xml)$, if matched, assigns the shell variable BASH_REMATCH[1] to the substring from the second rightmost slash to the extention ".xml" at the end of the string.
  • In the first loop on the file "toupdate.txt", it generates "unique part" and "fill path" pairs as key-value pairs of the associative array.
  • In the second loop on the file "masterlist.txt", the extracted "unique part" is tested if the associated value exists. If so, the line is substituted with the associated value, the line in "toupdate.txt" file.

[Alternative]
If the text files are large in size, bash may not be fast enough. In such a case, awk script will work more efficiently:

awk 'NR==FNR {
    if (match($0, "/[^/]+/[^/]*\.xml$")) {
        map[substr($0, RSTART, RLENGTH)] = $0
    }
    next
}
{
    if (match($0, "/[^/]+/[^/]*\.xml$")) {
        full_path = map[substr($0, RSTART, RLENGTH)]
        if (full_path != "") {
            $0 = full_path
        }
    }
    print
}' "toupdate.txt" "masterlist.txt" > "masterlist_tmp.txt"

[Explanations]

  • The NR==FNR { BLOCK1; next } { BLOCK2 } syntax is a common idiom to switch the processing individually for each file. As the NR==FNR condition meets only for the 1st file in the argument list and next statement skips the following block, BLOCK1 processes the file "toupdate.txt" only. Similarly BLOCK2 processes the file "masterlist.txt" only.
  • If the function match($0, pattern) succeeds, it sets the awk variable RSTART to the start position of the matched substring out of $0, the current record read from the file, then sets the variable RLENGTH to the length of the matched substring. Now we can extract the matched substring such as /f_SomeName/f_SomeName_user.xml by using the substr() function.
  • Then we assign the array map so that the substring (the unique part) is mapped to the whole url in "toupdate.txt".
  • The second block works mostly similar to the first block. If the value corresponding to the key is found in the array map, then the record ($0) is replaced with the value of the array indexed by the key.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...