Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
150 views
in Technique[技术] by (71.8m points)

Can I use regex to look for strings in a big file that meet the following conditions:

EDITED FOR CLARIFICATION & SPECIFICITY

I know this is a tough one, but I thought I'd ask anyway...

I'm using grep or egrep "grep-E" (with extended regex capability). I was also told that Strings could be used and may help with this effort, but I haven't fully explored that option yet...

Input file: is a binary file so it contains all kind of junk

Desired Output: strings that meet all of these conditions:

  1. Return ONLY strings with (8-24 readable characters), exclude white spaces " ", as they are are delimiters (separators) of strings in the input file.

  2. ONLY the following characters can makeup a string and are allowed ANYWHERE (beginning, end, middle) in a string:

"0-9" "a-z" "A-Z" ! # $ % ^ & ( ) @ ~ " ' ] ? [ * + ; , =

  1. The following characters are NOT allowed in a stream:

/ . | : < > except the dot '.' it can ONLY be at the beginning or at the end of the string, but NOT in the middle. BUT I have removed it completely form the regex, b/c I don't know the syntax of specifying taht it can only be at the end or beginning of a stream. and if I include the dot in the dot, it returns tons of "false strings" "junk"

  1. No stream should contain 3 or more repeated back-to-back characters i.e strings that have 3 or more repeated (back2back) chars should be ignored

i.e. aaab^s zY&$$$$[[[[[[777th, or ((((%%_+++------ should be ignored.

  1. All non-readable characters should be ignored is acceptable in a stream.

i.e. subscripts 1q n× ÷ ± D à ?? ? è á ? ù ? ? ò etc...

I've tested some of your suggestions and so far, this regex does about 90% of the job.

(?!(.)1{3})[0-9a-zA-Z!#$%^&()@~"'*-+][;,=]{8,24}

but only when tested on dubdubdubrubular.com or dubdubdub.gethifi.com/tools/regex For some reason, grep is chocking on it!!!

for your reference, I'm including a sample of the binary file in question:

Sample:

http://pastebin.com/wY6a0Uir

Note: if you test the sample on http://www.gethifi.com/tools/regex you'll see that returned line #21 for example should not have been returned.

Hope this clarifies the question a bit, and not confuse it more :)

Cheers!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If your regex engine supports it, you can use a zero-width negative lookahead assertion with a back reference. Add this to the beginning of your regex:

(?!(.)1{3})

So the full regex looks like this:

(?!(.)1{3})[0-9a-zA-Z!@#$%^()+_{}]{6,24}

Or this:

(?!(.)1{3})[!--/-~]{6,24}

Test it at:

http://rubular.com/r/RbYIXR4a16


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...