This relatively short document tries to be a quick introduction to C language for students who already know other languages, but need to make their hands dirty with operating systems. The driving force for writing it is the "real-time systems" course that I've been appointed to teach in Pavia in 2006.
The language in itself is pretty small, so this document covers almost all of it, though shortly. Thus, it isn't "From A to Z" but not "From A to C" either, as I initially planned.
To keep the text compact and avoid needless replication of other structured documentation, I won't be ashamed of touching Unix and GNU/Linux material with my examples, nor the gcc compiler about actual compilation. Similarly, I'll use half-truths when it helps simplifying the discussion.
C language lives near hardware. It was designed as a replacement for assembly in order to increase program portability. Translation to machine language is pretty straight and optimization techniques are well developed. Because of this it still is the most used language to write operating system kernels and other low-level tasks.
All "objects" managed by the language are simple ones. In fact, they are all integers, preferably the same size as machine registers. There is no formalized concept of "object", "class", "instance" and other beautiful things that are fashionable nowadays. Nonetheless, writing code in an object-oriented way is possible, and highly recommended.
There is no such thing in the language as a "boolean" type. Any zero value is considered false and any non-zero value is evaluated as true. Any definition of boolean within the language is an artificial one, I think it should be avoided as a needless and dangerous practice.
C is a procedural language, which is expressed as a sequence of procedures, here called functions. Each function is global to the application, it receives a number of arguments and returns a single value, or nothing at all.
Variables can be global or local to a specific function. They can be is simple or composite; simple types are integer and composite ones are data structures. A pointer is the address of other data (or code) in computer memory. The "null pointer" is actually zero, which is never a valid pointer.
Identifiers (i.e., the names of functions and variables) are made up
of letters, digits and the underscore character, the first character
not being a digit. Identifiers are case-sensitive; I personally
discourage using long names and uppercase letters: names like
SortArrayOfNames
slow considerably both writing
and reading programs -- writing on the keyboard and reading aloud.
I'd rather call the function sort_names
.
The compiler reads source code only once, so in some cases you need to declare a variable or a function, in addition to defining it. For example, when a function calls another function defined later in the same source file, you need to declare such a function in advance. Declarations for library functions are collected in files called headers, that are included at the beginning of every C source file (at their "head").
The preprocessor is a program that makes typographical modifications on the source file, before the file is seen by the compiler proper. The preprocessor is part of the compiler and is mandated by the language specification, so every C source gets preprocessed.
Each source line that begins with '#
' ("hash") is a directive
for the prepropessor. Such directives allow to physically include other
files, redefine the meaning of identifiers (by
purely typographical substitution within the source file), conditionally
disable par of the source code at compilation time (again, by discarding
the text before the compiler can see it). As apparent, it is a very
powerful tool but also a dangerous one; for example the compiler is
unable to check for syntax error in the disabled parts.
Execution of a program starts from the main
function,
which receive a few arguments that can be ignored at this point.
It returns an integer number. If main
returns 0,
it means that the program completed successfully; if it returns non-zero,
it means an error happened. The exact number may denote the kind of
error that happened, if who runs the program has the information to
tell them apart.
In case of "freestanding" programs, those that don't run under
an operating system, the main
function has no special role
and may be missing altogether. Such programs are for example kernels,
boot loaders, microcontroller firmware images.
Newlines, spaces and tabs have the same role. The indenting style is thus free, and different programmers use different styles. It's important, nonetheless, to avoid abusing of such freedom, and write code which is ordered and easily readable, indenting logic blocks appropriately.
An instruction can be a semicolon-terminated expression, a control primitive or a brace-enclosed block. The concept of expression includes everything, including variable assignment, with the only exception of control primitives.
control primitives are the following ones, where italics denote a syntax element, and the square brackets mark optional elements:
if ( expr ) istr [ else istr ] while ( expr ) istr for ( expr ; expr ; expr ) istr do istr while ( expr ) ; switch ( integer-expr ) { case: .... } break ; continue ; return [ expr ] ;
The switch
construct is a special case and requires
a section on its own, so we'll ignore it by now.
To define a function, you need to write the return type, followed by the function name and the list of arguments, each preceded by its own type. After that, the code is includes in braces. A function declaration, or "prototype", is like the definition but instead of the code you write a terminating semicolon.
A variable is defined by writing its type, its name and a terminating semicolon. If declare outside of functions, it is global, if declared inside a block (braces), it is local to that block.
Example: a prototype, a function, a global variable, the function that we declared earlier:
int sum(int a, int b); int average(int x, int y) { int s; s = sum(x, y); return s/2; } int globalv; int sum(int a, int b) { return a + b; }
The preprocessor is mainly used to include other files and define symbolic names for numeric constants. If the file included is named with angle brackets, then it's looked for within system headers, if named with double quotes, then it is looked for in the current directory first. Example:
#include <stdio.h> #include "application.h" #define ERR_NOERROR 0 #define ERR_INVALID 1 #define ERR_NODATA 2 #define ERR_PERMISSION 3
According to a convention that everybody follows, the constants defined by the preprocessor are uppercase, like in the example above. This allows them to be immediately identified while reading the program text, and avoid confusing them with variables. If your fellow programmers wrote their code properly, upper case means constant.
Comments are delimited by /*
e */
,
or start from //
and extend to end-of-line.
The second form comes from C++ and most C programmers don't like it.
It's always good practice to comment your program extensively, by
writing how and why it does, not what it does, since
what is already apparent in the code itself. This rule
applies to all languages, but it's so important that I'd better repeat it.
Simple data are integer numbers, or floating point values (that we
ignore as they are not used in system programming), or pointers. Integer
types are predefined in the language and are the following ones, but not
that signed
is usually omitted, as it applies by default.
char signed char unsigned char short signed short unsigned short int signed int unsigned int long signed long unsigned long
You can't make assumptions on the byte size of such types, but
in practice a char
is guaranteed to be 8 bits.
The int
type is usually 32 bits long, unless the host
processor is a 8-bits or 16-bits device (example: Arduino), but as
suggested you can't make assumptions. Your programs must not depend
on a specific size of the base data types.
All pointers are the same size, and they are either 32 or 64 bits
wide, according to the processor you work with. On all platforms,
unsigned long
and pointers have the same size (Windows
got it wrong, because their long is not long enough: let's ignore it).
A pointer is defined by writing the type it points to, the asterisk
and the name of the pointer variable. For example``int
*p;
''. You should think about the asterisk as "pointed to by"; so
the previous example is read as "it's integer [the value] pointed to by p".
The Linux kernel defines, within <linux/types.h>
,
the following sized types, both unsigned
and
signed
:
u8 s8 u16 s16 u32 s32 u64 s64
The C99 standard defines the following sized types.
The last type listed below is an integer that has
the same size of a pointer, so in practice it's
unsigned long
in all sane environments:
uint8_t int8_t uint16_t int16_t uint32_t int32_t uint64_t int64_t intptr_t
A data structure is a composite data type, whose components are called fields and can be either simple types or other data structures. A structure is declared in the following way:
struct name { field-type field-name ; [field-type field-name ; ... ] } ;
After declaring it, "struct name
" is the
name of a new type, that can be used to declare variables or pointers.
Example:
int count; struct stat stbuf; struct stat *stptr;
Structures can be initialized in three different ways. You can
list the fields using the comma as separator (traditional syntax); you
can assign field names with a colon (a gcc extension, that predates
standardization); you can use proper field assignment (standardized by C99,
supported by gcc as well). The first form should be avoided at it's
not easily readable; the second is discouraged as non-standard. In all
three cases, every field which is not initialized explicitly, is
zeroed bit-by-bit by the compiler. In the following example,
the three structures are identical, with the priv
field initialized to zero:
struct item {int id; char *name; int value; int priv;}; struct item i1 = {3, "robert", 45}; struct item i2 = {id: 3, name: "robert", value: 45}; struct item i3 = {.id = 3, .name = "robert", .value = 45};
Every function returns one value, either simple or composite,
or void
, i.e. nothing. A function receives zero or more
arguments.
Arguments are simple or composite types, and they are always passed by value. Arguments can be modified within the function like they were local variables. Passing by reference is not supported.
Even though the language allows it, data structures are not usually passed as arguments or used as return types. The preferred practice, for efficiency reasons, is allocating structures and passing only pointers to them as arguments. This is a way to pass arguments by reference.
If a function needs to return more than one value (for example, an integer number and an error code), you can pass a pointer argument, so the function can write the second return value to a variable of the calling function. Example:
int findid(struct obj *item, int *errorptr) { if (isvalid(item) == 0) { *errorptr = ERR_INVALID; return 0; } *errorptr = ERR_NOERROR; return internal_findid(item); }
You can define functions with a variable number of arguments,
called "variadic functions". The most common example
is printf
and its variants. Defining variadic functions
is not trivial, and is not discussed here.
Calling a variadic function is pretty common, and you just need to
correctly pass al arguments. In the case of the printf
family of functions, one of the first arguments is a string that
specifies how many arguments are needed and the type of each of them.
The variadic function uses the string to know what arguments to
expect and fetch them from memory. Since the string format is
standardized, the compiler can check all arguments and warn about
possible errors if the argument list is inconsistent with the string.
For variadic functions that can't be related to
printf
or scanf
or other few known patterns,
the compiler can't check
arguments.
Polimorphic functions don't exist in C: every function name can exist once only in each program, and every function call must pass the same number and type of arguments -- with the exception of variadic functions, where you can pass an arbitrary number of arguments after the initial ones, including no additional argument at all.
As already noted, the proprocessor is a program that filters source files before the compiler proper can see them.
By including headers you can access function prototypes, data structure declarations and global variables that are defined elsewhere. Normally the documentation of a library function also states which header you need to include to pass relevant information to the compiler.
With #define
you can define constants, like in the
earlier example, or macros that receive arguments. In any case the
substitution is merely typographic, which makes it easy to incur
in errors like the following:
#define square(a) a*a
square(1+2)
" which gets
expanded as "1+2*1+2
" and evaluates to 5.
Moreover, when a macro argument is expanded more than once, the macro
can't behave like a function because operators like "++
"
get repeated in the program text, with undesired effects.
The "#ifdef X
" - "#else
" -
"#endif
" form only checks whether the symbol X
is
defined (in the form of #define
) or not. The
"#if expr
" - "#else
" -
"#endif
" form evaluates a constant integer expression, so the value
must be known at compilation time. In #if
you can refer
to numbers, symbols defined earlier and integer operators, but you
can also use "defined(X)
". To avoid too many
conditional levels and too many
#endif
you can use #elif
, which means "else if".
That said, please avoid preprocessor conditionals if possible.
Libraries contain code and data, i.e. global functions and global
variables. Many functions used by C programs have standardized names,
and the filenames for headers you include are standardized as well.
So we include <stdio.h>
to work with files,
we include <string.h>
if we need to call
string functions (to compare, find the length, extract substrings and so on),
and there are many more standard headers.
We are not interested, within the topic of real-time systems, in
getting acquainted to the many functions you find in the standard library.
We just need to know that all of these global functions and variables
(like stdin
are stdout
) are part of the
«C library», that the compiler is automatically using when it need
to resolve undefined symbols in the source file. The compiler
might also refer to another library (for example
libgcc
), that contains procedures called by the compiler
within the generated object code; this library is automatically
included like the standard one, during the final compilation steps.
If you need additional libraries, like
libjpeg
, your source must include the relevant headers.
Please note that such headers are not the libraries.
The structure struct jpeg_compress_struct
is declared
in <jpeglib.h>
, but the function
jpeg_start_compress
is made up of machine code that lives
in another file, the library, that is used by the linker, not by the
preprocessor.
The compiler takes your C code, it asks the preprocessor to make typographic modifications to it and then translates to result to assembly code. It then passes the output to the assembler in order to convert to object files, that contain machine code.
Such object files include code and data, together with lists
of undefined symbols. A symbol, at this level, is just a name that needs
to be matched to a memory address. The final resolution of
of undefined symbols is performed by a program called "linker",
whose executable file is called «ld
» ("LoaDer").
Therefore, some compilation errors are reported by the linker and not by the compiler proper: typically this happens with "undefined symbol" errors when the C source contains a mistyped function name. According to how much symbolic information is part of the object file, the error message can state the exact line number in the source file or have no reference to source code.
Usually you let the preprocessor and assembler work with default settings, but sometimes you may need to directly control the actions of the linker, for example to specify which libraries it must look up to resolve undefined symbols. This is not needed for simple programs.
When you build programs that are not hosted within an operating system (uC firmware, kernel, boot-loader), the linker must be instructed to not use the standard library during symbol resolution.
What shown here summarizes the most important features of the C language and should be enough to be able to read well-written code and not feel completely lost.
There are, nonetheless, some additional points that I feel are important, so I split them in another document: A-C-X-more-en.html.
As a book, if you want to get one, I suggest the Kerninghan and Ritchie, which is a very good text. The others are usually horrible, and I don't know anything of intermediate quality.
«C for Java Programmers», http://www.cs.cornell.edu/courses/cs414/2001SP/tutorials/cforjava.htm, can be useful, even if it doesn't cover the parts that are most important for my course, and goes to detail in topics that I don't feel interesting.
Typical errors for the Java programmer when passing to C:: http://www.dcs.ed.ac.uk/home/iok/cforjavaprogrammers.phtml.
Wikpedia: http://en.wikipedia.org/wiki/C_%28programming_language%29 a http://it.wikipedia.org/wiki/C_%28linguaggio%29.
Home page of Dennis Ritchie, with historical references to C and Unix: https://www.bell-labs.com/usr/dmr/www/.
Authors admit C language is a hoax: http://www.gnu.org/fun/jokes/unix-hoax.html