Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
712 views
in Technique[技术] by (71.8m points)

regex - Regular expression to recognize variable declarations in C

I'm working on a regular expression to recognize variable declarations in C and I have got this.

[a-zA-Z_][a-zA-Z0-9]*

Is there any better solution?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A pattern to recognize variable declarations in C. Looking at a conventional declaration, we see:

int variable;

If that's the case, one should test for the type keyword before anything, to avoid matching something else, like a string or a constant defined with the preprocessor

(?:w+s+)([a-zA-Z_][a-zA-Z0-9]+)

variable name resides in 1.

The feature you need is look-behind/look-ahead.

UPDATE July 11 2015

The previous regex fail to match some variables with _ anywhere in the middle. To fix that, one just have to add the _ to the second part of the first capture group, it also assume variable names of two or more characters, this is how it looks after the fix:

(?:w+s+)([a-zA-Z_][a-zA-Z0-9_]*)

However, this regular expression has many false positives, goto jump; being one of them, frankly it's not suitable for the job, because of that, I decided to create another regex to cover a wider range of cases, though it's far from perfect, here it is:

(?:(?:autos*|consts*|unsigneds*|signeds*|registers*|volatiles*|statics*|voids*|shorts*|longs*|chars*|ints*|floats*|doubles*|_Bools*|complexs*)+)(?:s+*?*?s*)([a-zA-Z_][a-zA-Z0-9_]*)s*[[;,=)]

I've tested this regex with Ruby, Python and JavaScript and it works very well for the common cases, however it fails in some cases. Also, the regex may need some optimizations, though it is hard to do optimizations while maintaining portability across several regex engines.

Tests resume

unsignedchar *var;                   /* OK, doesn't match */
goto **label;                        /* OK, doesn't match */
int function();                      /* OK, doesn't match */
char **a_pointer_to_a_pointer;       /* OK, matches +a_pointer_to_a_pointer+ */
register unsigned char *variable;    /* OK, matches +variable+ */
long long factorial(int n)           /* OK, matches +n+ */
int main(int argc, int *argv[])      /* OK, matches +argc+ and +argv+ (needs two passes) */
const * char var;                    /* OK, matches +var+, however, it doesn't consider +const *+ as part of the declaration */
int i=0, j=0;                        /* 50%, matches +i+ but it will not match j after the first pass */
int (*functionPtr)(int,int);         /* FAIL, doesn't match (too complex) */

False positives

The following case is hard to cover with a portable regular expression, text editors use contexts to avoid highlighting text inside quotes.

printf("int i=%d", i);               /* FAIL, match i inside quotes */

False positives (syntax errors)

This can be fixed if one test the syntax of the source file before applying the regular expression. With GCC and Clang one can just pass the -fsyntax-only flag to test the syntax of a source file without compiling it

int char variable;                  /* matches +variable+ */

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...