Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
386 views
in Technique[技术] by (71.8m points)

io - Collecting strings from files and writing to an output file using Julia and Regex

(Julia and general programming newb)

I'm trying to read a directory full of JSON files containing lots of HTML pages (about 30), Regex match short strings (many per file, up to 60k total) and output these to one big file - which I'll try and parse later so I can add to a MySQL DB.

Here's my code:

patFilename = r"[0-9]+_[0-9]+.json"
patID = r"/entry/[0-9]+/go"

filenames = readdir("C:/getentries/data/")

caseIDs = []

for filename in filenames
    if match(patFilename, filename) === nothing
        continue
    end

    file = open("C:/getentries/data/" * filename)
    case = read(file, String)

    push!(caseIDs, match(patID, case))

end

println(caseIDs)

touch("C:/getentries/data/caseIDs.txt")
open("C:/getentries/data/caseIDs.txt", "w") do caseID
    println(caseID, caseIDs)
end

No errors are thrown but only a few strings are written to the file. So I'm assuming something's going wrong as I try to collect all the strings.

I thought I could try the approach suggested in my last question but this didn't help - although that's likely due to my complete inexperience!

May I ask if anyone has any thoughts?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It's hard to say without a minimal, reproducible example. But my guess is that, since you're calling match once per file, you're only getting the first match in each file. Instead, you could call eachmatch to get an iterator over all matches in the file contents.

This would look something like the following:

for filename in filenames
    # Note that you forgot to close the file in your original example
    # Using higher-level functions such as this method of `read` may be safer
    str = read(filename, String)
   
    # Loop over all matches of the regexp found in the string
    for m in eachmatch(pattern, str)
        push!(matches, m)
    end
end

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...