Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
699 views
in Technique[技术] by (71.8m points)

awk - Remove words found in the second file

I need to compare 2 files and remove the words from the text file those are found in the second (exclude list) file.

# cat remove.txt
test
junk
trash
unwanted
bad
worse

# cat corpus.txt
this is a test message to check if bad words are removed correctly. The second line may or may not have unwanted words. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for trash.

This python code is working as expected.

import re

stop_words = list()
with open("remove.txt", "r") as f:
    for i in f.readlines():
        stop_words.append(i.replace("
", ""))
        
# !> filteredtext.txt

file1 = open("corpus.txt")

line = file1.read()
words = line.split()

for r in words:
    r = re.sub(r"[^ws]", "", r)
    if not r in stop_words:
        appendFile = open("filteredtext.txt", "a")
        appendFile.write(" " + r)
        appendFile.close()

I will like to know if there is linux command line magic possible in this case. The regular expression mentioned in the python code is optional. The cleaned text need not be 100% clean. More than 90% accuracy is ok.

Expected output:

 this is a message to check if words are removed correctly The second line may or may not have words The third line also need not be as clean as first and second line There can be paragraphs in the text corpus and the entire file should be checked for
question from:https://stackoverflow.com/questions/65895621/remove-words-found-in-the-second-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may use this gnu awk command:

awk -v RS='[[:space:]]+' 'FNR == NR {seen[$1]; next} !($1 in seen) {ORS=RT; print}' remove.txt corpus.txt

On a 450MB remove.txt file above awk command took 1 min 16 sec to complete.

To make it more readable:

awk -v RS='[[:space:]]+' 'FNR == NR {
   seen[$1]
   next
}
!($1 in seen) {
   ORS = RT
   print
}' remove.txt corpus.txt

Earlier Solution: Using a single gnu sed script:

sed -f <(sed 's~.*~s/ *\<&\> *//~' remove.txt) corpus.txt

this is amessage to check ifwords are removed correctly. The second line may or may not havewords. The third line also need not be as clean as first and second line.
There can be paragraphs in the text corpus and the entire file should be checked for.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...