Avoid misuse of ctype(3) functions

The ctype(3) character classification and mapping functions have a
peculiarly limited definition (C11, Sec. 7.4 `Character handling
<ctype.h>', p. 200):

	`The header <ctype.h> declares several functions useful for
	 classifying and mapping characters.  In all cases the
	 argument is an int, the value of which shall be
	 representable as an unsigned char or shall equal the value
	 of the macro EOF.  If the argument has any other value, the
	 behavior is undefined.'

In other words, in the most common case of 8-bit char and EOF = -1,
the domain of the 257 allowed arguments is:

	-1, 0, 1, 2, ..., 254, 255

The ctype(3) functions are designed for use with stdio functions like
getchar and fgetc which return int values in the same domain.

In an ABI where char is signed (e.g., x86 SysV ABI used by most
Unixish operating systems), passing an argument of type char as is
can go wrong in two ways:

1. The value of a non-EOF input octet interpreted as `char' may
   coincide, as an integer, with the value of EOF, leading to wrong
   answers for some non-EOF inputs.

   E.g., if EOF = 1, and an input octet has all bits set, i.e., 255
   as an unsigned char, then as a char the value is -1, which will be
   confused with EOF.  In the ISO-8859-1 locale, the code point 255
   is (in Unicode terminology) LATIN SMALL LETTER Y WITH DIAERESIS,
   for which isprint, isalpha, &c., are true.  But isprint, isalpha,
   &c., are false for EOF.  So if char *s points to a string with
   that character, isprint(*s) will return false when it should
   return true.

2. Passing a negative char whose value does not coincide with EOF is
   undefined behaviour.

   This isn't purely theoretical: often the functions are implemented
   by an array lookup, #define isprint(c) (ctypetab[c] & ISPRINT).
   If c is out of range (e.g., 192, ISO-8859-1 for LATIN CAPITAL
   LETTER A WITH GRAVE, which convers to (signed) char as -64), then
   you can get garbage answers by reading uninitialized memory or
   application crashes with SIGSEGV if the page preceding the table
   is unmapped.

If what you have is an arbitrary char (e.g., from a char * string
pointing at user input), then the only correct way to use the
ctype(3) functions is by converting to unsigned char first -- e.g.,
isprint((unsigned char)*s).  (If the functions were defined as macros
that convert to unsigned char first, they would then spuriously
interpret EOF as a non-EOF, so they can't do that themselves.)

It is possible, in some cases, to prove that the input always
actually lies in {0, 1, 2, ..., 127}, so the conversion to unsigned
char is not necessary.  I didn't check whether this was the case --
it's safer to just adopt the habit of always casting char to unsigned
char first before using the ctype(3) macros, which satisfies a
compiler warning on some systems designed to detect this class of
application errors at compile-time.
This commit is contained in:
Taylor R Campbell 2022-04-05 12:51:33 +00:00
parent d2b44eccb7
commit b1ccd52183
3 changed files with 6 additions and 6 deletions

View file

@ -290,11 +290,11 @@ read_excludes (cairo_perf_t *perf,
/* whitespace delimits */ /* whitespace delimits */
s = line; s = line;
while (*s != '\0' && isspace (*s)) while (*s != '\0' && isspace ((unsigned char)*s))
s++; s++;
t = s; t = s;
while (*t != '\0' && ! isspace (*t)) while (*t != '\0' && ! isspace ((unsigned char)*t))
t++; t++;
if (s != t) { if (s != t) {

View file

@ -407,11 +407,11 @@ read_excludes (cairo_perf_t *perf,
/* whitespace delimits */ /* whitespace delimits */
s = line; s = line;
while (*s != '\0' && isspace (*s)) while (*s != '\0' && isspace ((unsigned char)*s))
s++; s++;
t = s; t = s;
while (*t != '\0' && ! isspace (*t)) while (*t != '\0' && ! isspace ((unsigned char)*t))
t++; t++;
if (s != t) { if (s != t) {

View file

@ -1515,11 +1515,11 @@ read_excludes (test_trace_t *test, const char *filename)
/* whitespace delimits */ /* whitespace delimits */
s = line; s = line;
while (*s != '\0' && isspace (*s)) while (*s != '\0' && isspace ((unsigned char)*s))
s++; s++;
t = s; t = s;
while (*t != '\0' && ! isspace (*t)) while (*t != '\0' && ! isspace ((unsigned char)*t))
t++; t++;
if (s != t) { if (s != t) {