Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
170 views
in Technique[技术] by (71.8m points)

stdio - How to read / parse input in C? The FAQ

I have problems with my C program when I try to read / parse input.

Help?


This is a FAQ entry.

StackOverflow has many questions related to reading input in C, with answers usually focussed on the specific problem of that particular user without really painting the whole picture.

This is an attempt to cover a number of common mistakes comprehensively, so this specific family of questions can be answered simply by marking them as duplicates of this one:

  • Why does the last line print twice?
  • Why does my scanf("%d", ...) / scanf("%c", ...) fail?
  • Why does gets() crash?
  • ...

The answer is marked as community wiki. Feel free to improve and (cautiously) extend.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The Beginner's C Input Primer

  • Text mode vs. Binary mode
  • Check fopen() for failure
  • Pitfalls
    • Check any functions you call for success
    • EOF, or "why does the last line print twice"
    • Do not use gets(), ever
    • Do not use fflush() on stdin or any other stream open for reading, ever
    • Do not use *scanf() for potentially malformed input
    • When *scanf() does not work as expected
  • Read, then parse
    • Read (part of) a line of input via fgets()
    • Parse the line in-memory
  • Clean Up

Text mode vs. Binary mode

A "binary mode" stream is read in exactly as it has been written. However, there might (or might not) be an implementation-defined number of null characters ('') appended at the end of the stream.

A "text mode" stream may do a number of transformations, including (but not limited to):

  • removal of spaces immediately before a line-end;
  • changing newlines (' ') to something else on output (e.g. " " on Windows) and back to ' ' on input;
  • adding, altering, or deleting characters that are neither printing characters (isprint(c) is true), horizontal tabs, or new-lines.

It should be obvious that text and binary mode do not mix. Open text files in text mode, and binary files in binary mode.

Check fopen() for failure

The attempt to open a file may fail for various reasons -- lack of permissions, or file not found being the most common ones. In this case, fopen() will return a NULL pointer. Always check whether fopen returned a NULL pointer, before attempting to read or write to the file.

When fopen fails, it usually sets the global errno variable to indicate why it failed. (This is technically not a requirement of the C language, but both POSIX and Windows guarantee to do it.) errno is a code number which can be compared against constants in errno.h, but in simple programs, usually all you need to do is turn it into an error message and print that, using perror() or strerror(). The error message should also include the filename you passed to fopen; if you don't do that, you will be very confused when the problem is that the filename isn't what you thought it was.

#include <stdio.h>
#include <string.h>
#include <errno.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        fprintf(stderr, "usage: %s file
", argv[0]);
        return 1;
    }

    FILE *fp = fopen(argv[1], "r");
    if (!fp) {
        // alternatively, just `perror(argv[1])`
        fprintf(stderr, "cannot open %s: %s
", argv[1], strerror(errno));
        return 1;
    }

    // read from fp here

    fclose(fp);
    return 0;
}

Pitfalls

Check any functions you call for success

This should be obvious. But do check the documentation of any function you call for their return value and error handling, and check for those conditions.

These are errors that are easy when you catch the condition early, but lead to lots of head-scratching if you do not.

EOF, or "why does the last line print twice"

The function feof() returns true if EOF has been reached. A misunderstanding of what "reaching" EOF actually means makes many beginners write something like this:

// BROKEN CODE
while (!feof(fp)) {
    fgets(buffer, BUFFER_SIZE, fp);
    printf("%s", buffer);
}

This makes the last line of the input print twice, because when the last line is read (up to the final newline, the last character in the input stream), EOF is not set.

EOF only gets set when you attempt to read past the last character!

So the code above loops once more, fgets() fails to read another line, sets EOF and leaves the contents of buffer untouched, which then gets printed again.

Instead, check whether fgets failed directly:

// GOOD CODE
while (fgets(buffer, BUFFER_SIZE, fp)) {
    printf("%s", buffer);
}

Do not use gets(), ever

There is no way to use this function safely. Because of this, it has been removed from the language with the advent of C11.

Do not use fflush() on stdin or any other stream open for reading, ever

Many people expect fflush(stdin) to discard user input that has not yet been read. It does not do that. In plain ISO C, calling fflush() on an input stream has undefined behaviour. It does have well-defined behavior in POSIX and in MSVC, but neither of those make it discard user input that has not yet been read.

Usually, the right way to clear pending input is read and discard characters up to and including a newline, but not beyond:

int c;
do c = getchar(); while (c != EOF && c != '
');

Do not use *scanf() for potentially malformed input

Many tutorials teach you to use *scanf() for reading any kind of input, because it is so versatile.

But the purpose of *scanf() is really to read bulk data that can be somewhat relied upon being in a predefined format. (Such as being written by another program.)

Even then *scanf() can trip the unobservant:

  • Using a format string that in some way can be influenced by the user is a gaping security hole.
  • If the input does not match the expected format, *scanf() immediately stops parsing, leaving any remaining arguments uninitialized.
  • It will tell you how many assignments it has successfully done -- which is why you should check its return code (see above) -- but not where exactly it stopped parsing the input, making graceful error recovery difficult.
  • It skips any leading whitespaces in the input, except when it does not ([, c, and n conversions). (See next paragraph.)
  • It has somewhat peculiar behaviour in some corner cases.

When *scanf() does not work as expected

A frequent problem with *scanf() is when there is an unread whitespace (' ', ' ', ...) in the input stream that the user did not account for.

Reading a number ("%d" et al.), or a string ("%s"), stops at any whitespace. And while most *scanf() conversion specifiers skip leading whitespace in the input, [, c and n do not. So the newline is still the first pending input character, making either %c and %[ fail to match.

You can skip over the newline in the input, by explicitly reading it e.g. via fgetc(), or by adding a whitespace to your *scanf() format string. (A single whitespace in the format string matches any number of whitespace in the input.)

Read, then parse

We just adviced against using *scanf() except when you really, positively, know what you are doing. So, what to use as a replacement?

Instead of reading and parsing the input in one go, as *scanf() attempts to do, separate the steps.

Read (part of) a line of input via fgets()

fgets() has a parameter for limiting its input to at most that many bytes, avoiding overflow of your buffer. If the input line did fit into your buffer completely, the last character in your buffer will be the newline (' '). If it did not all fit, you are looking at a partially-read line.

Parse the line in-memory

Especially useful for in-memory parsing are the strtol() and strtod() function families, which provide similar functionality to the *scanf() conversion specifiers d, i, u, o, x, a, e, f, and g.

But they also tell you exactly where they stopped parsing, and have meaningful handling of numbers too large for the target type.

Beyond those, C offers a wide range of string processing functions. Since you have the input in memory, and always know exactly how far you have parsed it already, you can walk back as many times you like trying to make sense of the input.

And if all else fails, you have the whole line available to print a helpful error message for the user.

Clean Up<


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...