From A to X passing through C

This relatively short document tries to be a quick introduction to C language for students who already know other languages, but need to make their hands dirty with operating systems. The driving force for writing it is the "real-time systems" course that I've been appointed to teach in Pavia in 2006.

The language in itself is pretty small, so this document covers almost all of it, though shortly. Thus, it isn't "From A to Z" but not "From A to C" either, as I initially planned.

To keep the text compact and avoid needless replication of other structured documentation, I won't be ashamed of touching Unix and GNU/Linux material with my examples, nor the gcc compiler about actual compilation. Similarly, I'll use half-truths when it helps simplifying the discussion.

Basic concepts

C language lives near hardware. It was designed as a replacement for assembly in order to increase program portability. Translation to machine language is pretty straight and optimization techniques are well developed. Because of this it still is the most used language to write operating system kernels and other low-level tasks.

All "objects" managed by the language are simple ones. In fact, they are all integers, preferably the same size as machine registers. There is no formalized concept of "object", "class", "instance" and other beautiful things that are fashionable nowadays. Nonetheless, writing code in an object-oriented way is possible, and highly recommended.

There is no such thing in the language as a "boolean" type. Any zero value is considered false and any non-zero value is evaluated as true. Any definition of boolean within the language is an artificial one, I think it should be avoided as a needless and dangerous practice.

C is a procedural language, which is expressed as a sequence of procedures, here called functions. Each function is global to the application, it receives a number of arguments and returns a single value, or nothing at all.

Variables can be global or local to a specific function. They can be is simple or composite; simple types are integer and composite ones are data structures. A pointer is the address of other data (or code) in computer memory. The "null pointer" is actually zero, which is never a valid pointer.

Identifiers (i.e., the names of functions and variables) are made up of letters, digits and the underscore character, the first character not being a digit. Identifiers are case-sensitive; I personally discourage using long names and uppercase letters: names like SortArrayOfNames slow considerably both writing and reading programs -- writing on the keyboard and reading aloud. I'd rather call the function sort_names.

The compiler reads source code only once, so in some cases you need to declare a variable or a function, in addition to defining it. For example, when a function calls another function defined later in the same source file, you need to declare such a function in advance. Declarations for library functions are collected in files called headers, that are included at the beginning of every C source file (at their "head").

The preprocessor is a program that makes typographical modifications on the source file, before the file is seen by the compiler proper. The preprocessor is part of the compiler and is mandated by the language specification, so every C source gets preprocessed.

Each source line that begins with '#' ("hash") is a directive for the prepropessor. Such directives allow to physically include other files, redefine the meaning of identifiers (by purely typographical substitution within the source file), conditionally disable par of the source code at compilation time (again, by discarding the text before the compiler can see it). As apparent, it is a very powerful tool but also a dangerous one; for example the compiler is unable to check for syntax error in the disabled parts.

Execution of a program starts from the main function, which receive a few arguments that can be ignored at this point. It returns an integer number. If main returns 0, it means that the program completed successfully; if it returns non-zero, it means an error happened. The exact number may denote the kind of error that happened, if who runs the program has the information to tell them apart.

In case of "freestanding" programs, those that don't run under an operating system, the main function has no special role and may be missing altogether. Such programs are for example kernels, boot loaders, microcontroller firmware images.

Quick syntax

Newlines, spaces and tabs have the same role. The indenting style is thus free, and different programmers use different styles. It's important, nonetheless, to avoid abusing of such freedom, and write code which is ordered and easily readable, indenting logic blocks appropriately.

An instruction can be a semicolon-terminated expression, a control primitive or a brace-enclosed block. The concept of expression includes everything, including variable assignment, with the only exception of control primitives.

control primitives are the following ones, where italics denote a syntax element, and the square brackets mark optional elements:

if ( expr ) istr [ else istr ]
while ( expr ) istr
for ( expr ; expr ; expr ) istr
do istr while ( expr ) ;
switch ( integer-expr ) { case: .... }
break ;
continue ;
return [ expr ] ;

The switch construct is a special case and requires a section on its own, so we'll ignore it by now.

To define a function, you need to write the return type, followed by the function name and the list of arguments, each preceded by its own type. After that, the code is includes in braces. A function declaration, or "prototype", is like the definition but instead of the code you write a terminating semicolon.

A variable is defined by writing its type, its name and a terminating semicolon. If declare outside of functions, it is global, if declared inside a block (braces), it is local to that block.

Example: a prototype, a function, a global variable, the function that we declared earlier:

int sum(int a, int b);

int average(int x, int y)
{
      int s;
      s =  sum(x, y);
      return s/2;
}

int globalv;

int sum(int a, int b)
{
      return a + b;
}

The preprocessor is mainly used to include other files and define symbolic names for numeric constants. If the file included is named with angle brackets, then it's looked for within system headers, if named with double quotes, then it is looked for in the current directory first. Example:

#include <stdio.h>
#include "application.h"
#define ERR_NOERROR    0
#define ERR_INVALID    1
#define ERR_NODATA     2
#define ERR_PERMISSION 3

According to a convention that everybody follows, the constants defined by the preprocessor are uppercase, like in the example above. This allows them to be immediately identified while reading the program text, and avoid confusing them with variables. If your fellow programmers wrote their code properly, upper case means constant.

Comments are delimited by /* e */, or start from // and extend to end-of-line. The second form comes from C++ and most C programmers don't like it. It's always good practice to comment your program extensively, by writing how and why it does, not what it does, since what is already apparent in the code itself. This rule applies to all languages, but it's so important that I'd better repeat it.

Data types

Simple data are integer numbers, or floating point values (that we ignore as they are not used in system programming), or pointers. Integer types are predefined in the language and are the following ones, but not that signed is usually omitted, as it applies by default.

char      signed char    unsigned char
short     signed short   unsigned short
int       signed int     unsigned int
long      signed long    unsigned long

You can't make assumptions on the byte size of such types, but in practice a char is guaranteed to be 8 bits. The int type is usually 32 bits long, unless the host processor is a 8-bits or 16-bits device (example: Arduino), but as suggested you can't make assumptions. Your programs must not depend on a specific size of the base data types.

All pointers are the same size, and they are either 32 or 64 bits wide, according to the processor you work with. On all platforms, unsigned long and pointers have the same size (Windows got it wrong, because their long is not long enough: let's ignore it).

A pointer is defined by writing the type it points to, the asterisk and the name of the pointer variable. For example``

int
*p;

''. You should think about the asterisk as "pointed to by"; so the previous example is read as "it's integer [the value] pointed to by p".

The Linux kernel defines, within <linux/types.h>, the following sized types, both unsigned and signed:

u8    s8      u16     s16
u32   s32     u64     s64

The C99 standard defines the following sized types. The last type listed below is an integer that has the same size of a pointer, so in practice it's unsigned long in all sane environments:

uint8_t    int8_t    uint16_t   int16_t
uint32_t   int32_t   uint64_t   int64_t
intptr_t

Data Structures

A data structure is a composite data type, whose components are called fields and can be either simple types or other data structures. A structure is declared in the following way:

struct name {
    field-type field-name ;
   [field-type field-name ;  ... ]
} ;

After declaring it, "struct name" is the name of a new type, that can be used to declare variables or pointers. Example:

int count;
struct stat stbuf;
struct stat *stptr;

Structures can be initialized in three different ways. You can list the fields using the comma as separator (traditional syntax); you can assign field names with a colon (a gcc extension, that predates standardization); you can use proper field assignment (standardized by C99, supported by gcc as well). The first form should be avoided at it's not easily readable; the second is discouraged as non-standard. In all three cases, every field which is not initialized explicitly, is zeroed bit-by-bit by the compiler. In the following example, the three structures are identical, with the priv field initialized to zero:

struct item {int id; char *name; int value; int priv;};
struct item i1 = {3, "robert", 45};
struct item i2 = {id: 3, name: "robert", value: 45};
struct item i3 = {.id = 3, .name = "robert", .value = 45};

Functions

Every function returns one value, either simple or composite, or void, i.e. nothing. A function receives zero or more arguments.

Arguments are simple or composite types, and they are always passed by value. Arguments can be modified within the function like they were local variables. Passing by reference is not supported.

Even though the language allows it, data structures are not usually passed as arguments or used as return types. The preferred practice, for efficiency reasons, is allocating structures and passing only pointers to them as arguments. This is a way to pass arguments by reference.

If a function needs to return more than one value (for example, an integer number and an error code), you can pass a pointer argument, so the function can write the second return value to a variable of the calling function. Example:

int findid(struct obj *item, int *errorptr)
{
    if (isvalid(item) == 0) {
        *errorptr = ERR_INVALID;
        return 0;
    }
    *errorptr = ERR_NOERROR;
    return internal_findid(item);
}

You can define functions with a variable number of arguments, called "variadic functions". The most common example is printf and its variants. Defining variadic functions is not trivial, and is not discussed here.

Calling a variadic function is pretty common, and you just need to correctly pass al arguments. In the case of the printf family of functions, one of the first arguments is a string that specifies how many arguments are needed and the type of each of them. The variadic function uses the string to know what arguments to expect and fetch them from memory. Since the string format is standardized, the compiler can check all arguments and warn about possible errors if the argument list is inconsistent with the string. For variadic functions that can't be related to printf or scanf or other few known patterns, the compiler can't check arguments.

Polimorphic functions don't exist in C: every function name can exist once only in each program, and every function call must pass the same number and type of arguments -- with the exception of variadic functions, where you can pass an arbitrary number of arguments after the initial ones, including no additional argument at all.

The preprocessor

As already noted, the proprocessor is a program that filters source files before the compiler proper can see them.

By including headers you can access function prototypes, data structure declarations and global variables that are defined elsewhere. Normally the documentation of a library function also states which header you need to include to pass relevant information to the compiler.

With #define you can define constants, like in the earlier example, or macros that receive arguments. In any case the substitution is merely typographic, which makes it easy to incur in errors like the following:

#define square(a)  a*a

The "#ifdef X" - "#else" - "#endif" form only checks whether the symbol X is defined (in the form of #define) or not. The "#if expr" - "#else" - "#endif" form evaluates a constant integer expression, so the value must be known at compilation time. In #if you can refer to numbers, symbols defined earlier and integer operators, but you can also use "defined(X)". To avoid too many conditional levels and too many #endif you can use #elif, which means "else if". That said, please avoid preprocessor conditionals if possible.

Libraries

Libraries contain code and data, i.e. global functions and global variables. Many functions used by C programs have standardized names, and the filenames for headers you include are standardized as well. So we include <stdio.h> to work with files, we include <string.h> if we need to call string functions (to compare, find the length, extract substrings and so on), and there are many more standard headers.

We are not interested, within the topic of real-time systems, in getting acquainted to the many functions you find in the standard library. We just need to know that all of these global functions and variables (like stdin are stdout) are part of the «C library», that the compiler is automatically using when it need to resolve undefined symbols in the source file. The compiler might also refer to another library (for example libgcc), that contains procedures called by the compiler within the generated object code; this library is automatically included like the standard one, during the final compilation steps.

If you need additional libraries, like libjpeg, your source must include the relevant headers. Please note that such headers are not the libraries. The structure struct jpeg_compress_struct is declared in <jpeglib.h>, but the function jpeg_start_compress is made up of machine code that lives in another file, the library, that is used by the linker, not by the preprocessor.

The linker

The compiler takes your C code, it asks the preprocessor to make typographic modifications to it and then translates to result to assembly code. It then passes the output to the assembler in order to convert to object files, that contain machine code.

Such object files include code and data, together with lists of undefined symbols. A symbol, at this level, is just a name that needs to be matched to a memory address. The final resolution of of undefined symbols is performed by a program called "linker", whose executable file is called «ld» ("LoaDer").

Therefore, some compilation errors are reported by the linker and not by the compiler proper: typically this happens with "undefined symbol" errors when the C source contains a mistyped function name. According to how much symbolic information is part of the object file, the error message can state the exact line number in the source file or have no reference to source code.

Usually you let the preprocessor and assembler work with default settings, but sometimes you may need to directly control the actions of the linker, for example to specify which libraries it must look up to resolve undefined symbols. This is not needed for simple programs.

When you build programs that are not hosted within an operating system (uC firmware, kernel, boot-loader), the linker must be instructed to not use the standard library during symbol resolution.

To probe further

What shown here summarizes the most important features of the C language and should be enough to be able to read well-written code and not feel completely lost.

There are, nonetheless, some additional points that I feel are important, so I split them in another document: A-C-X-more-en.html.

External references

As a book, if you want to get one, I suggest the Kerninghan and Ritchie, which is a very good text. The others are usually horrible, and I don't know anything of intermediate quality.

From A to X passing through C:an extract of C language.