The C programming language is imperative, general-purpose and low-level. Invented in the early 1970s to rewrite Unix, C has become one of the most widely used languages to this day. Many more modern languages such as C++, C#, Java and PHP or JavaScript have taken a syntax similar to C and partly take its logic. C offers the developer a significant margin of control over the machine (especially on memory management) and is therefore used to create the “foundations” (compilers, interpreters …) of these more modern languages.
C (programming language) | |
---|---|
Date of the first release | 1972 |
Paradigm | Imperative, procedural, structured |
Author | Dennis Ritchie, Brian Kernighan |
Developer | Dennis Ritchie and Kenneth Thompson, Bell Labs |
Typing | Static, weak |
Norms | ANSI X3.159-1989 (ANSI C, C89) ISO/CEI 9899:1990 (C90) ISO/IEC 9899:1990/AMD1:1995 (C95) ISO/CEI 9899:1999 (C99) ISO/CEI 9899:2011 (C11) ISO/IEC 9899:2018 (C18) |
Influenced by | BCPL, B, Algol 68, Fortran |
Influence of | awk, csh, C++, C#, Objective-C, D, Concurrent C, Java, JavaScript, PHP, Perl |
Implementations | GCC, MSVC, Borland C, Clang, TCC |
File extensions | .c, .h |
History of C

The C language was invented in 1972 at Bell Laboratories. It was developed at the same time as Unix by Dennis Ritchie and Ken Thompson. Kenneth Thompson had developed C’s direct predecessor, the B language, which is itself largely inspired by BCPL. Dennis Ritchie evolved the B language into a new version that was sufficiently different, including adding types, to be called C.
Although C is directly derived from B, Ritchie also has influences from PL/I, FORTRAN and ALGOL 68. In addition, Ritchie reports that the team was convinced of the merits of writing an operating system in a language of a higher level than assembler, a pioneering aspect of Multics, written in PL/I.
Subsequently, Brian Kernighan helped popularize the C language. He also made some last-minute changes. In 1978, Kernighan was the lead author of the book The C Programming Language describing the language finally stabilized; Ritchie had taken care of the appendices and examples with Unix. This book is also called “the K&R”, and we speak of traditional C or C K&R when referring to the language as it existed at that time.
Normalization
In 1983, the U.S. National Standards Institute (ANSI) formed a language standardization committee (X3J11) which resulted in 1989 in the so-called ANSI C or C89 standard (formally ANSI X3.159-1989). In 1990, this standard was also adopted by the International Organization for Standardization (C90, C ISO, formally ISO/IEC 9899:1990). ANSI C is an evolution of the C K&R that remains extremely compatible. It takes up some ideas from C++, including the notion of prototype and type qualifiers.
Between 1994 and 1996, the ISO working group (ISO/IEC JTC1/SC22/WG14) published two patches and one amendment to C90: ISO/IEC 9899/COR1:1994 Technical Corrigendum 1, ISO/IEC 9899/AMD1:1995 Integrity of C and ISO/IEC 9899/COR1:1996 Technical Corrigendum 2. These rather modest changes are sometimes referred to as C89 with amendment 1, or C94/C95. Three header files have been added, two for broad characters and another defining a number of macros related to the ISO 646 character standard.
In 1999, a new evolution of the language was standardized by ISO: C99 (formally ISO/IEC 9899:1999). New features include variable-sized arrays, restricted pointers, complex numbers, compound literals, declarations mixed with statements, inline functions, advanced float number support, and C++ comment syntax. The C standard library has been expanded with six header files since the previous standard.
In 2011, ISO ratified a new version of the standard: C11, formally ISO/IEC 9899:2011. This evolution introduced support for multi-threaded programming, generic expressions, and better Unicode support.
In 2018, ISO ratified a new version: formally ISO/IEC 9899:2018, also known as C18 or C17. This development seeks to clarify and correct the points of contention and does not introduce any functional novelty.
A future release, codenamed C23 and C2x, is under development.
General characteristics
It is a procedural and general-purpose programming language. It is referred to as a low-level language in the sense that each instruction in the language is designed to be compiled into a fairly predictable number of machine instructions in terms of memory occupancy and computational load. In addition, it offers a range of integer and floating types designed to correspond directly to the data types supported by the processor. Finally, it makes intensive use of memory address calculations with the notion of pointer.
Apart from the basic types, C supports enumerated, compound, and opaque types. On the other hand, it does not propose any operation that directly deals with higher-level objects (computer file, string, list, hash table, etc.). These more advanced types must be handled by manipulating pointers and compound types. Similarly, the language does not offer object-oriented programming management as standard, nor an exception handling system. There are standard functions to manage input-output and strings, but unlike other languages, no specific operator to improve ergonomics. This makes it easy to replace standard functions with functions specifically designed for a given program.
These characteristics make it a preferred language when trying to control the hardware resources used, the machine language and binary data generated by compilers being relatively predictable. This language is therefore widely used in areas such as embedded programming on microcontrollers, intensive computing, writing operating systems and modules where processing speed is important. It is a good alternative to assembly language in these areas, with the advantages of a more expressive syntax and portability of the source code. The C language was invented to write the Unix operating system, and is still used for system programming. Thus the kernel of large operating systems such as Windows and Linux are developed largely in C.
On the other hand, the development of C programs, especially if they use complex data structures, is more difficult than with higher-level languages. Indeed, for the sake of performance, the C language requires the user to program certain processes (memory freeing, checking the validity of indices on arrays, etc.) which are automatically supported in high-level languages.
Stripped of the conveniences provided by its standard library, C is a simple language, and so is its compiler. This is reflected in the development time of a C compiler for a new processor architecture: Kernighan and Ritchie estimated that it could be developed in two months because “we will find that 80% of the code of a new compiler is identical to that of the codes of other compilers already existing. “
Qualities and defects
It is one of the most used languages because:
- it has been around for a long time, in the early 1970s;
- it is based on an open standard;
- many computer scientists know him;
- It allows the minimization of the necessary memory allocation, its complete control and the maximization of performance, in particular by the use of pointers;
- it allows the construction of complex and ad-hoc data structures, as close as possible to needs;
- Software compilers and libraries exist on most architectures;
- it has influenced many more recent languages including C++, Java, C# and PHP; its syntax in particular is widely used;
- it implements a limited number of concepts, which facilitates its mastery and the writing of simple and fast compilers;
- It does not rigidly specify the behavior of the executable file produced, which helps to take advantage of the capabilities of each computer.
- It allows, by compiling directly to the machine language (via the assembler), the writing of software that does not need any runtime support (neither software library nor virtual machine), with predictable behavior in execution time as in RAM consumption, such as operating system kernels and embedded software.
Its main disadvantages are:
- the few checks offered by the original compilers (K&R C), and the absence of checks at runtime, which means that errors that could have been automatically detected during development were only detected at runtime, therefore at the cost of a software crash;
- on UNIX, one could use the lint and cflow utilities to avoid such miscounts.
- checks are added over time, but they remain partial,
- His approach to modularity remained at the level of what was done in the early 1970s, and largely surpassed since then by other languages:
- it does not facilitate object-oriented programming,
- it does not allow you to create namespaces,
- very basic exception handling;
- the very limited support for genericity, despite the introduction of generic expressions in C11;
- The intricacies of writing portable programs, because the exact behavior of executables depends on the target computer.
- Minimalist support for memory allocation and strings, forcing programmers to deal with tedious and bug-prone details; In particular, there is no standard garbage collector;
- bugs graves that can be caused by a simple lack of attention from the developer; such as buffer overflow, which is a computer security vulnerability exploitable by malware;
- Some errors can only be detected automatically using additional, non-standardized tools, such as lint and Valgrind;
- the low productivity of the language compared to newer languages.
Syntax overview
Hello world
The Hello World program was proposed as an example in 1978 in The C Programming Language by Brian Kernighan and Dennis Ritchie. Creating a program displaying “hello world” has since become the reference example for presenting the basics of a new language. Here is the original example from the 1st edition of 1978:
main()
{
printf("hello, world\n");
}
main
is the name of the main function, also known as the program entry point.- The parentheses
()
aftermain
indicate that this is a function. - Braces
{
and}
surround the statements that make up the body of the function. printf
is a standard output write feature, which produces the display in the default console.- Characters
"
delimit a string;hello, world\n
in this case. - The
\n
characters are an escape sequence representing the line break. - A semicolon
;
ends the expression statement.
Evolution of practices
The same program, compliant with the ISO standard and following contemporary good practices:
#include <stdio.h> int main(void) { printf("hello, world\n"); return 0; }
#include <stdio.h>
includes the standard header<stdio.h>
containing the declarations of the I/O functions of the standard C library, including theprintf
function used here.int
is the type returned by the functionmain
. Theint
type is the implicit type in K&R C and C89, and it was commonly omitted when the Kernighan and Ritchie example was written. It is mandatory in C99.- The keyword
void
in parentheses means that the function has no parameters. It can be unambiguously omitted when defining a function. On the other hand, if it is omitted when declaring the function, it means that the function can receive any parameters. This feature of the declaration is considered obsolete in the C 2011 standard. It can be noted that in the MISRA C 2004 standard, which imposes restrictions on C89 for uses requiring greater security, the keywordvoid
is mandatory for the declaration as well as for the definition of a function without arguments. - The statement
return 0;
indicates that the functionmain
returns the value 0. This value is of typeint
, and corresponds to theint
in front of themain
.
Brevity of syntax
The syntax of C was designed to be brief. Historically, it has often been opposed to that of Pascal, an imperative language also created in the 1970s. Here is an example with a factorial function:
/* In C (ISO norm) */ int factorial(int n) { if (n > 0) return n * factorial(n - 1); else return 1; }
{ In Pascal } function factorial(n: integer) : integer begin if n > 0 then factorial := n * factorial(n - 1) else factorial := 1 end.
Where Pascal uses the keywords function
, integer
, begin
, if
, then
, else
, and end
, C uses only int
, if
, else
, and return
, with the other keywords replaced by parentheses and braces.
Expression language
The brevity of C is not based solely on syntax. The large number of operators available, the fact that most statements contain an expression, that expressions almost always produce a value, and that test statements simply compare the value of the expression being tested with zero, contribute to the brevity of the source code.
Here is the example of a string copy function — the principle of which is to copy the characters until you have copied the null character, which by convention marks the end of a C string — given in The C Programming Language, 2nd edition, p. 106:
void strcpy(char *s, char *t) { while (*s++ = *t++) ; }
The loop while
uses a classic C writing style, which has helped give it a reputation as an unreadable language. The expression *s++ = *t++
contains: two pointer dereferences; Two pointer increments an assignment; and the assigned value is compared with zero by the while
. This loop has no body, because all operations are performed in the test expression of the while
. It is considered that this type of notation must be mastered to master C.
For comparison, a version that does not use shortcut operators or implicit comparison to zero would give:
void strcpy(char *s, char *t) { while (*t != '\0') { *s = *t; s = s + 1; t = t + 1; } *s = *t; }
From sources to executables
Sources
A program written in C is usually divided into several source files compiled separately.
C source files are text files, usually in the character encoding of the host system. They can be written with a simple text editor. There are many editors, even integrated development environments (IDEs), that have specific functions to support writing sources in C.
The practice is to give the filename extensions .c
and .h
to C source files. .h
files are called header files. They are designed to be included at the beginning of source files, and contain only declarations.
When a .c
or .h
file uses an identifier declared in another .h
file, then it includes the latter. The principle generally applied is to write an .h
file for each .c
file, and to declare in the file .h
everything that is exported by the file .c
.
The generation of an executable from the source files is done in several steps, which are often automated using tools such as make, SCons, or tools specific to an integrated development environment. There are four steps leading from sources to the executable file: precompilation, compilation, assembly, linking. When a project is compiled, only .c
files are part of the list of files to compile; .h
files are included by the preprocessor directives contained in the source files.
Precompilation
The C preprocessor executes directives contained in the source files. He recognizes them by the fact that they are at the beginning of the line, and all start with the cross character #
. Some of the most common guidelines include:
#include
for inclusion;#define
for the macro definition.#if
to start conditional compilation.#ifdef
and#ifndef
, equivalent to#if defined
and#if! defined;
#endif
to close the conditional compilation.
In addition to executing directives, the preprocessor replaces comments with white space, and replaces macros. For the rest, the source code is transmitted as is to the compiler for the next phase. However, each #include
in the source code must be recursively replaced by the included source code. Thus, the compiler receives a single source from the preprocessor, which is the compilation unit.
The following is an example of a copyarray.h
source file that makes typical use of preprocessor directives:
#ifndef COPYARRAY_H #define COPYARRAY_H #include <stddef.h> void copyArray(int *, size_t); #endif
The #ifndef
, #define
, and #endif
directives ensure that the code inside is compiled only once even if it is included multiple times. The #include <stddef.h>
directive includes the header that declares the size_t
type used below.
Compilation
The compilation phase usually consists of generating assembly code. This is the most treatment-intensive phase. It is accomplished by the compiler itself. For each compilation unit, a file is obtained in assembly language.
This step can be divided into sub-steps:
- lexical analysis, which is the recognition of language keywords;
- parsing, which analyzes the structure of the program and its compliance with the standard;
- code optimization;
- the writing of code isomorphic to that of the assembler (and sometimes the assembly code itself when requested as an option of the compiler).
By abuse of language, we call compilation the entire phase of generating an executable file from the source files. But this is only one of the steps leading to the creation of an executable.
Some C compilers operate at this level in two phases, the first generating a file compiled into an intermediate language for an ideal virtual machine (see Bytecode or P-Code) portable from one platform to another, the second converting the intermediate language into an assembly language dependent on the target platform. Other C compilers allow you not to generate an assembly language, but only the file compiled in the intermediate language, which will be interpreted or compiled automatically into native code at runtime on the target machine (by a virtual machine that will be linked to the final program).
Assembly
This step consists of generating a machine language object file for each assembly code file. Object files are usually .o
on Unix, and .obj
with development tools for MS-DOS, Microsoft Windows, VMS, CP/M… This phase is sometimes grouped with the previous one by establishing an internal data flow without going through intermediate language or assembly language files. In this case, the compiler directly generates an object file.
For compilers that generate intermediate code, this assembly phase can also be completely eliminated: it is a virtual machine that will interpret or compile this language into native machine code. The virtual machine can be a component of the operating system or a shared library.
Editing links in C
Editing the links is the last step, and aims to bring together all the elements of a program. The different object files are then brought together, as well as the static libraries, to produce only one executable file.
The purpose of linking is to select useful code elements present in a set of compiled code and libraries, and to resolve mutual references between these different elements in order to allow them to reference directly at program execution. Link editing fails if referenced code elements are missing.
Elements of language in C
Lexical elements
The ASCII character set is sufficient to write in C. It is even possible, but unusual, to restrict oneself to the invariant character set of ISO 646, using escape sequences called trigrams. Typically, C sources are written with the character set of the host system. However, the runtime character set may not be the one of the source.
C is case sensitive. White characters (space, tab, end of line) can be freely used for layout, as they are equivalent to a single space in most cases.
Keywords
The C89 has 32 keywords, five of which did not exist in K&R C, and which are in alphabetical order:
auto
,break
,case
,char
,const
(C89),continue
,default
,do
,double
,else
,enum
(C89),extern
,float
,for
,goto
,if
,int
,long
,register
,return
,short
,signed
(C89),sizeof
,static
,struct
,switch
,typedef
,union
,unsigned
,void
(C89),volatile
(C89),while
.
These are reserved terms that should not be used otherwise.
Revision C99 adds five:
_Bool
, _Complex
, _Imaginary
, inline
, restrictive
.
These new keywords start with a prefixed capitalization of an underscore to maximize compatibility with existing codes. Standard library headers provide the aliases bool
(<stdbool.h>
), complex
, and imaginary
(<complex.h>
).
The latest revision, C11, introduces seven new keywords with the same conventions:
_Alignas
,_Alignof
,_Atomic
,_Generic
,_Noreturn
,_Static_assert
,_Thread_local
.
The standard headers <stdalign.h>
, <stdnoreturn.h>
, <assert.h>
, and <threads.h>
provide the aliases alignas
and alignof
, noreturn
, static_assert
, and thread_local
, respectively.
Preprocessor instructions
The C language preprocessor provides the following directives:
#include
,#define
,#pragma
(C89),#if
,#ifdef
,#ifndef
,#elif
(C89),#else
,#endif
,#undef
,#line
,#error
.
Types
The C language includes many types of integers, occupying more or fewer bits. The size of the types is only partially standardized: the standard only sets a minimum size and a minimum magnitude. Minimum magnitudes are compatible with binary representations other than the complement to two, although this representation is almost always used in practice. This flexibility allows the language to be efficiently adapted to a wide variety of processors, but it complicates the portability of programs written in C.
Each integer type has a “signed” form that can represent both negative and positive numbers, and an “unsigned” form that can represent only natural numbers. Signed and unsigned shapes must be the same size.
The most common type is int
, it represents the word machine.
Unlike many other languages, the char
type is an integer type like any other, although it is generally used to represent characters. Its size is by definition one byte.
Whole types, in ascending order | ||
---|---|---|
Type | Minimum capacity for representation | Minimum magnitude required by the standard |
Char |
as signed char or unsigned char , depending on the implementation |
8-bit |
signed char |
-127 to 127 | |
unsigned char (C89) |
0 to 255 | |
short signed short |
-32,767 to 32,767 | 16-bit |
unsigned short |
0 to 65,535 | |
int signed int |
-32,767 to 32,767 | 16-bit |
unsigned int |
0 to 65,535 | |
long signed long |
-2,147,483,647 to 2,147,483,647 | 32-bit |
unsigned long |
0 to 4,294,967,295 | |
long long (C99)signed long long (C99) |
-9,223,372,036,854,776,000 to 9,223,372,036,854,776,000 | 64-bit |
unsigned long long (C99) |
0 to 18,446,744,073,709,552,000 |
The types listed are defined with the enum
keyword.
There are types of floating-point number, precision, so bit length, variable; In ascending order:
Decimal types, in ascending order | ||
---|---|---|
Type | Precision | Magnitude |
float |
≥ 6 decimal digits | about 10 to 10 |
double |
≥ 10 decimal digits | about 10 to 10 |
long double |
≥ 10 decimal digits | about 10 to 10 |
long double (C89) |
≥ 10 decimal digits |
C99 added float complex
, double complex
and long double complex
, representing the associated complex numbers.
Types developed:
struct
,union
,*
for pointers;[
…]
for tables;(
…)
for functions.
Type _Bool
is standardized by C99. In earlier versions of the language, it was common to define a synonym:
typedef enum boolean {false, true} bool;
The void
type represents the void, like a list of empty function parameters, or a function that returns nothing.
The void*
type is the generic pointer: any data pointer can be implicitly converted to and from void*
. For example, it is the type returned by the standard malloc
function, which allocates memory. This type does not lend itself to operations requiring to know the size of the pointed type (pointer arithmetic, dereferencing).
Structures
C supports compound types with the notion of structure. To define a structure, use the struct
keyword followed by the name of the structure. Members must then be declared in braces. Like any statement, a semicolon ends it all.
/* Declaration of the person structure */ struct Person { int age; char *name; };
To access the members of a structure, you must use .
.
int main() { struct Person p; p.name = "Albert"; p.age = 46; }
Functions can receive pointers to structures. They work with the same syntax as regular pointers. However, the ->
operator must be used on the pointer to access the fields in the structure. It is also possible to dereference the pointer not to use this operator, and always use the operator .
.
void birthday(struct Person * p) { p->age++; printf("Happy birthday %s !", (*p).name); } int main() { struct Person p; p.name = "Albert"; p.age = 46; birthday(&p); }
Comment
In versions of C prior to C99, comments had to start with a forward slash and an asterisk (“/*”) and end with an asterisk and a forward slash. Almost all modern languages have used this syntax to write comments in code. Everything between these symbols is commentary, including line breaks:
/* This is a comment
on two lines
or more */
The C99 standard has taken from C++ the end-of-line comments, introduced by two forward slashes and ending with the line:
// Comment to the end of the line
Control structures
The syntax of the various existing control structures in C is widely used in several other languages, such as C++ of course, but also Java, C#, PHP or JavaScript.
The three main types of structures are present:
- tests (also called conditional branching) with:
if (expression) instruction
if (expression) instruction else instruction
switch (expression) instruction
, withcase
anddefault
in the instruction
- loops with:
while (expression) instruction
for (expression_optional ; expression_optional ; expression_optional) instruction
do instruction while (expression)
- Jumps (unconditional connections):
break
continue
return expression_optional
goto tag
Functions
Functions in C are blocks of instructions, receiving one or more arguments and can return a value. If a function returns no value, the void
keyword is used. A function can also receive no arguments. The void
keyword is recommended in this case.
// Function returning no value (called procedure) void display(int a) { printf("%d", a); } // Function returning an integer int sum(int a, int b) { return a + b; } // Function without any arguments int enter(void) { int a; scanf("%d", &a); return a; }
Prototype
A prototype consists of declaring a function and its parameters without the instructions that compose it. A prototype ends with a semicolon.
// Enter prototype int enter(void); // Function using enter int sum(void) { int a = enter(), b = enter(); return a + b; } // Enter definition int enter(void) { int a; scanf("%d", &a); return a; }
Typically, all prototypes are written to .h files, and functions are defined in a .c file.
Ambiguous behaviors
The C language standard deliberately leaves some operations without precise specifications. This property of C allows compilers to directly use processor-specific instructions, perform optimizations, or skip certain operations, to compile short and efficient executable programs. On the other hand, it is sometimes the cause of portability bugs of source codes written in C.
There are three categories of such behaviors:
- Implemented defined: The behavior is not specified in the standard but depends on the implementation. The choice made in an implementation must be documented in the implementation. A program using this type of behavior is correct, if not guaranteed portable.
- unspecified: The choice is not specified in the standard, but this time does not have to be documented. In particular, it does not have to be identical each time for the same implementation. A program using this type of behavior is also correct.
- undefined: As the name suggests, the operation is not defined. The standard imposes no limitation on what the compiler can do in this case. Anything can happen. The program is incorrect.
Implementation-defined behaviors
In C, the behaviors defined by the implementation are those where the implementation must choose a behavior and stick to it. This choice can be free or from a list of possibilities given by the standard. The choice must be documented by the implementation, so that the programmer can know and use it.
One of the most important examples of such behavior is the size of entire data types. Standard C specifies the minimum size of base types, but not their exact size. Thus, the type int for example, corresponding to the word machine, must have a minimum size of 16 bits. It can have a size of 16 bits on a 16-bit processor and a size of 64 bits on a 64-bit processor.
Another example is the representation of signed integers. It can be the complement to two, the complement to one, or a system with a sign bit and value bits. The vast majority of modern systems use the two-man complement, which is for example the only one still supported by GCC. Older systems use other formats, such as the IBM 7090 which uses the sign/value format, the PDP-1 or the UNIVAC and its descendants, some of which are still in use today such as the UNIVAC 1100/2200 series#UNISYS 2200 series, which use the add-on to one.
Another example is the right shift of a negative signed integer. Typically, the implementation can choose to shift as for an unsigned integer or propagate the high-weight bit representing the sign.
Unspecified behaviors
Unspecified behaviors are similar to implementation-defined behaviors, but the behavior adopted by the implementation does not need to be documented. He doesn’t even have to be the same in all circumstances. Nevertheless, the program remains correct, the programmer just cannot rely on a particular rule.
For example, the order in which parameters are evaluated during a function call is not specified. The compiler can even choose to evaluate in a different order the parameters of two calls to the same function, if it can help its optimization.
Undefined behaviors
The C standard defines certain cases where syntactically valid constructs have undefined behavior. Depending on the standard, anything can happen: the compilation can fail, or produce an executable that will be interrupted, or that will produce false results, or even that will give the appearance of running without error. When a program contains undefined behavior, it is the behavior of the entire program that becomes undefined, not just the behavior of the statement containing the error. Thus, an erroneous instruction can corrupt data that will be processed much later, postponing the manifestation of the error. And even without being executed, an erroneous instruction can cause the compiler to perform optimizations based on false assumptions, producing an executable that does not do what is intended at all.
Examples
We can point out the classic division by zero, or the multiple assignment of a variable in the same expression with the example:
int i = 4; i = i++; /* Undefined behavior. */
One might think that in this example i
could be worth 4 or 5 depending on the choice of compiler, but it could just as easily be 42 or the assignment could stop the execution, or the compiler could refuse the compilation. No guarantee exists as soon as indefinite behavior exists.
To cite just a few examples, the dereferencing of a null pointer, any access to an array outside its limits, the use of an uninitialized variable or the overflow of signed integers all have undefined behaviors. The compiler can use the fact that a build is undefined in some cases to assume that this case never occurs and more aggressively optimize the code. While the above example may seem obvious, some complex examples can be much more subtle and sometimes cause serious bugs.
For example, a lot of code contains checks to avoid execution in out-of-bounds cases, which might look like this:
char buffer[BUFLEN]; char *buffer_end = buffer + BUFLEN; unsigned int len; /* ... */ if (buffer + len >= buffer_end || /* Buffer overflow check */ buffer + len < buffer) /* Overflow check if very wide len */ return; /* If no overflow, perform the planned operations */ /* ... */
On the surface, this code is cautious and performs the necessary security checks so as not to overflow the allocated buffer. In practice, recent versions of compilers such as GCC, Clang or Microsoft Visual C++ can remove the second test, and make overflows possible. Indeed, the standard specifies that pointer arithmetic on an object cannot give a pointer out of that object. The compiler can therefore decide that the test is still wrong and delete it. The correct verification is as follows:
char buffer[BUFLEN]; unsigned int len; /* ... */ if (len >= BUFLEN) /* Buffer overflow check */ return; /* If no overflow, perform the planned operations */ /* ... */
In 2008, when GCC developers modified the compiler to optimize certain overflow checks that were based on undefined behaviors, CERT issued a warning about using recent versions of GCC. These optimizations are actually present in most modern compilers, CERT has revised its warning in this direction.
Some tools exist to detect these problematic constructs, and the best compilers detect some of them (sometimes you have to enable particular options) and can flag them, but none claim to be exhaustive.
Software libraries in C
The standard library
The standardized standard library, available with all implementations, has the simplicity of a low-level language. Here is a list of some headers declaring types and functions of the standard library:
<assert.h>
: for run-time design diagnostics (assert
);<ctype.h>
: testing and classification of characters (isalnum
,tolower
);<errno.h>
: minimal error handling (errno
variable declaration);<math.h>
: basic mathematical functions (sqrt
,cos
); many additions in C99;<signal.h>
: signal management (signal
andraise
);<stddef.h>
: general definitions (declaration of theNULL
constant);<stdio.h>
: for basic inputs/outputs (printf
,scanf
);<stdlib.h>
: general functions (malloc
,rand
);<string.h>
: manipulation of character strings (strcmp
,strlen
);<time.h>
: time manipulation (time
,ctime
).
The standard library offers no support for GUI, networking, serial or parallel port I/O, real-time systems, processes, or advanced error handling (such as structured exceptions). This could further restrict the practical portability of programs that need to use some of these features, without the existence of many portable libraries and compensating for this lack; in the UNIX world, this need has also led to the emergence of another standard, POSIX.1.
External libraries in C
As C is one of the most widely used programming languages, many libraries have been created for use with C: glib, etc. Frequently, when inventing a data format, a C library or reference software exists to manipulate the format. This is the case for zlib, libjpeg, libpng, Expat, MPEG reference decoders, libsocket, etc.
C examples
Here are some examples presenting very briefly some properties of C. For more information, see the WikiBook Programming C.
Memory allocation
The int_list structure represents an item in a linked list of integers. The following two functions (insert_next
and remove_next
) are used to add and remove an item from the list.
/* Memory management is not integrated into the language but provided by the standard functions of the library */ #include <stdlib.h> struct int_list { struct int_list *next; /* Pointer to next element */ int value; /* Item value */ }; /* * Add one item after another. * node : element after which to add the new * value : value of the element to add * Return : address of added element, or NULL in case of error. */ struct int_list *insert_next(struct int_list *node, int value) { /* Memory allocation for a new element. */ struct int_list *const new_next = malloc(sizeof *new_next); /* If the allocation was successful, then insert new_next between node and node->next. */ if (new_next) { new_next->next = node->next; node->next = new_next; new_next->value = value; } return new_next; } /* * Delete the next item after another. * node : element from which the next one is deleted * Warning: indeterminate behavior if there is no following element! */ void remove_next(struct int_list *node) { struct int_list *const node_to_remove = node->next; /* Remove the next item from the list. */ node->next = node->next->next; /* Frees the memory occupied by the next element. */ free(node_to_remove); }
In this example, the two essential functions are malloc
and free
. The first is used to allocate memory, the parameter it receives is the number of bytes that we want to allocate and it returns the address of the first byte that was allocated, otherwise, it returns NULL. free
is used to free up the memory that has been allocated by malloc
.
Some notable programs written in C
- UNIX
- GNU Compiler Collection (GCC)
- Linux kernel
- Microsoft Windows Kernel
- GNOME
References (sources)
|