paritybit.ca

Some Notes on Program Style and Composition

This document describes my guidelines and philosophies for programming style and composition. Even in languages which enforce a certain style, it’s still easy to make many mistakes which can lead to programs being hard to read and less maintainable.

Keep in mind that no programming style should be treated as “high gospel”. It is ridiculous to get into flame wars over where to place braces and what kind of indentation to use. There are demonstrable pros and cons to different styles and people will choose what they like best. As long as it doesn’t produce objectively ugly, hard-to-read code, it is acceptable enough.

My overarching philosophy is that programmers should write programs for humans to read, not computers to execute. Write your code like you’re trying to explain to your fellow programmers (which includes future you) exactly how the program works and what it is doing. Code that is “cleverly” written is often code that is poorly written (unless you’re doing it for a competition or for fun).

Where possible, use common sense to make your program as readable as it can be. When working on a codebase that is not your own, follow the existing style (if there is one). If a programming language has a standardized style (e.g. Go’s gofmt or Python’s PEP-8), abide by that style; just use the standard formatter and get on with your life.

Also see:

Dependencies

Limit the use of external dependencies as much as possible. In order of preference:

  1. Write bespoke libraries for your purpose
  2. Clone an existing library and merge it into your codebase so that you maintain it alongside your code (vendor it)
  3. Depend on an external library

While it is reasonable to expect some amount of dependency on external things (unless you write your own computing stack from the ground-up), limiting that as much as possible ensures that you are less likely to be hit by things like breakages due to an incompetent or malicious open source maintainer or breakages caused by an unexpected update to something you’re using. It gives your programs the best chance of being usable far into the future instead of rotting within a few months or years. Not to mention that it’s going to be far better for you as a programmer to know how to actually program things instead of just knowing how to glue together different libraries under the framework du jour.

The trend in JavaScript (npm), Python (pip), Rust (cargo), and Go (its de-facto dependencies on Google and GitHub, though it is in a much better position than the others) is very concerning for the future of programs written in these languages. In fact, I have run into issues getting older Python programs to run due to packages no longer being available and it’s hard to compile a new version of a Go program for the first time when there’s a GitHub outage. Implementation diversity and distributed sources lends itself to a far healthier ecosystem.

Complexity and Optimization

Try to minimize complexity as much as possible. Complexity does not necessarily refer to lines of code. Complexity often arises from complicated program logic, bad program design, “cleverly-written” or “clean™” code, and excessive reliance on external dependencies, where lines of code is just a symptom of these diseases.

The more code a program has, the more bugs your program can have. The more features your code has, the more ways those features can interact in unexpected ways and result in bugs. The more cleverly a program is written, the harder it is to spot bugs or debug.

Also, simple algorithms and data structures are often preferable over fancy ones. Fancier algorithms and data structures offer more opportunity for sneaky bugs and are also harder to debug than their simpler counterparts. Implement them only when you know you need them (see the paragraph below).

Finally, this is one point that I wish was hammered into every programmer’s head: measure, don’t assume the performance characteristics of your program. Too many programs are written in convoluted or confusing ways in an attempt to chase perceived performance gains without any actual benchmarking or as a result of useless micro-benchmarks.

Only tune once you understand how your actual, real-world program behaves because only then can you justify the increased complexity that comes along with bespoke optimizations and only then will you know what is worth optimizing and what is not. There is no point making a program harder to read by applying optimizations that lead to extremely marginal improvements.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, [they] will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.

Donald Knuth - The Art of Computer Programming

Comments

Comments should exist in code to express why something is being done a certain way or to clarify a particularly tricky bit of code that is hard to express in any other way. Comments like the following are completely useless and should not exist:

struct Node *np = &node     /* Create a pointer to the node */

Comments should also not be annoyingly large banners, and big blocks of comments should be avoided. Definitely don’t put a copy of your code’s license at the top of each file; a simple, single-line statement referring to the LICENSE file is often sufficient.

If you use the “doxygen-style” of commenting before a function and at the top of a file, you must treat this as writing high-quality documentation. Otherwise it ends up being as useless as the above example in practice. I have often seen these kinds of comments used as an excuse to not write good documentation (“Look, we have doxygen-generated docs that list all the parameters; we documented this code!”) which makes these kinds of comments just useless clutter.

Great uses of comments include:

Also see: The Misunderstood Concepts of Code Comments and Comments on Comments.

Conditions

It is often preferable to be explicit when checking conditions. For example:

if (x == 1) { ... }

if (ptr == NULL) { ... }

while (*s != '\0') { ... }

is clearer than:

if (x) { ... }

if (!ptr) { ... }

while (*s) { ... }

Naming

Names should be descriptive and clear in context, but not redundant or excessive. In general, procedures should be named based on what they do and functions should be named based on what they return. Variables and types should be nouns (e.g. num_cakes or Parser). For example:

screen_changer();
do();

len = length_computer(x);
if (check_length(len))

is worse than:

draw();
blink();

len = compute_length(x);
if (valid_length(len))

Types, classes, structs or other similar programming structures should be written in capitalised CamelCase, variables and functions in snake_case, and constants in ALL_CAPS. If the components of a variable name are short, there are a maximum of two, and they are distinct, it is often fine to just write the name in all lowercase as opposed to snake case. For example, maxval over max_val, but not percenttrue over percent_true. Write what is clearest.

Also, if variables represent specific units of a measurement (e.g. milliseconds or kilometers), this should be reflected in its name. This is also why snake_case is preferable to camelCase for variables. timeMs, timeMS, timeMilliseconds or mixing cases like radioRange_km all look worse than time_ms or radio_range_km.

Another reason why snake_case is preferable is that it eliminates the need to decide whether acronyms should be capitalized. Should you write fetchRssFeedAsXml or fetchRSSFeedAsXML? Eliminate the need to make that decision by just writing fetch_rss_feed_as_xml. Whatever you do, please don’t mix the two (XMLHttpRequest is a classic real-world example, probably done because nothing else looked better, but xml_http_request is perfectly clear and doesn’t make capital letters play double-duty as word boundary markers and as abbreviations).

Furthermore, it is acceptable to use short names where the meaning is immediately clear from the context. For example, np is better than node_pointer or nodepointer as long as you are consistent and it is obvious what the variable refers to in a given context (e.g. you created it like: struct Node *np = &node; or it has a very short lifetime).

Single-letter names are acceptable only in contexts in which they are obvious or well-understood convention/notation (e.g. x and y in mathematics, i and j for loop iterator variables).

Finally, if two variables are related in some way, they should have consistent naming. For example, if you have a variable representing a maximum and minimum word count, choose names like max_word_count and min_word_count, not max_word_count and words_required.

In summary…

Worse

time = 5000
distance = 5
maximumValueUntilOverflow = 65535

ComputeResult(initialValue, newValue, modifier)

FileName = "file.txt"
coordinate_pair updatedcoordinate = (ComputeX(x), y)

class SetupTeardownIncluder()

Better

maxval = 65535
time_ms = 5000
distance_km = 5

compute_result(initial_value, new_value, modifier)

filename = "file.txt"
CoordinatePair updated_coordinate = (compute_x(x), y)

class PageBuilder()

References

Braces and Parentheses

Braces and parentheses should be used liberally to make code easy to follow.

Although many have unfortunately settled on the following brace style for functions, control statements, and the like:

if (condition) {
    do_some_stuff();
    and_some_more();
}

it is, in my opinion, far more readable in general to put opening braces on a new line, (Allman Style):

if (condition)
{
    do_some_stuff()
    and_some_more();
}

While this might appear excessive for such a trivial statement, the readability advantages quickly become clear when blocks get longer or more complex.

The additional vertical whitespace afforded by the opening brace being on a new line helps to visually separate distinct blocks of code, and it is much easier to find matching braces and keep track of block boundaries since the opening and closing braces visually line up.

It also means that, should the conditions in an if statement or the arguments to a function grow too long to fit comfortably on one line, there is still clear separation between the statement and its body without needing to awkwardly double-indent the set of conditions or arguments.

For example:

if (is_logged_in(client)
        && client->assignedAddress
        && strncmp(client->username, "admin", sizeof("admin")) == 0
        && authenticate(client->password, password)) {
    render_admin_panel();
}

is not as nice as:

if (is_logged_in(client)
    && client->assignedAddress
    && strncmp(client->username, "admin", sizeof("admin")) == 0
    && authenticate(client->password, password))
{
    render_admin_panel();
}

Similarly:

for (s = opts; (p = strsep(&s, ",")) != NULL;) {
    /* always leave space for one more argument and the NULL */
    if (argc >= maxargc - 3) {
        int newmaxargc = maxargc + 50;

        argv = ereallocarray(argv, newmaxargc, sizeof(char *));
        maxargc = newmaxargc;
    }
    if (*p != '\0') {
        if (*p == '-') {
            argv[argc++] = p;
            p = strchr(p, '=');
            if (p) {
                *p = '\0';
                argv[argc++] = p+1;
            }
        }
        else {
            argv[argc++] = "-o";
            argv[argc++] = p;
        }
    }
}

The above code is from fsck.c in the OpenBSD codebase and is BSD-3-Clause-licensed.

is not as nice as:

for (s = opts; (p = strsep(&s, ",")) != NULL;)
{
    /* always leave space for one more argument and the NULL */
    if (argc >= maxargc - 3)
    {
        int newmaxargc = maxargc + 50;

        argv = ereallocarray(argv, newmaxargc, sizeof(char *));
        maxargc = newmaxargc;
    }
    if (*p != '\0')
    {
        if (*p == '-')
        {
            argv[argc++] = p;
            p = strchr(p, '=');
            if (p)
            {
                *p = '\0';
                argv[argc++] = p+1;
            }
        }
        else
        {
            argv[argc++] = "-o";
            argv[argc++] = p;
        }
    }
}

This also means that subsequent control statements should go on their own line like so:

if (condition)
{
    do();
    things();
}
else
{
    other();
    stuff();
}

instead of:

if (condition)
{
    do();
    things();
} else
{
    other();
    stuff();
}

because it is much easier to select the else statement (for deletion, copying, etc.) if it is not on the same line as the closing brace of the if statement. I find it easier to read this way too.

Also, please never put complex single-line bodies on the same line as control statements. It’s much harder to read. Don’t do this:

for (i = 0; i < 10; i++) for (j = 0; j < 10; j++) printf("(%d,%d)\n", i, j);

Also, in general, single-line control blocks like so:

if (condition)
    do_stuff();

should have surrounding braces. Although it can seem excessive, especially paired with my other preferences, it eliminates a class of errors that arises when code is added to a control block or a line is commented but you forget to also add the braces:

if (condition)
    do_stuff();
    do_more_stuff(); <- This is not in the if block!

if (condition)
    /* do_stuff(); */ <- I just commented this...

do_other_stuff(); <- ... so now this is in the if block!

(Although whitespace-indented languages such as Python do not have to deal with this issue, that brings along other issues with code readability due to blocks not being as clearly delineated.)

Plus, I find it more ergonomic to be able to quickly comment out or add statements without also then having to add or remove braces to keep the style consistent or the code correct. I also find it easier to read when control statements are next to each other such as in:

if (condition)
{
    continue;
}
if (other_condition)
{
    printf("Other condition reached!\n");
    do_something();
}

compared to:

if (condition)
    continue;
if (other_condition)
{
    printf("Other condition reached!\n");
    do_something();
}

Line Length

Hard limits on line length are outmoded and poorly reasoned. Unless you are literally programming with punched cards or using an extremely limited display (e.g. you’re programming for a retro computer), there is probably no good reason to set a hard limit on the length of lines.

However, you should strive to keep lines within a reasonable length, so you and other programmers who work with your code do not have to scroll a reasonably-sized editor to see the ends of a line. Limiting the length of your lines to something reasonable also allows you to have multiple editor windows or panes side-by-side without needing to scroll each of them.

Additionally, shorter line lengths are easier for people to read since long lines make it harder to find the beginning of the next line. While this mostly applies to prose, it can also apply to code in things like code comments or dense blocks of code that are awkward to split up.

So avoid excessively long lines, but don’t stress about a line that is 81 or even 90 characters long. Use common sense. If your code is more readable wrapped to smaller lines, then do that. If wrapping would make your statement look awkward and it’s not too long, then don’t.

However, it’s still worth it to set up your editor to display a line at 80 columns because this can be a helpful guideline to help you notice when you’re indenting excessively or writing something convoluted.

Indentation, Spacing and Alignment

Before touching on the hot-button topic a few notes about other aspects:

In general, if you find your code being heavily nested, this should be a sign to refactor or reconsider your approach. An oft-repeated adage (originating from Linus Torvalds) is that “if you need more than 3 levels of indentation, you’re screwed anyway, and should fix your program.” While this is not a hard-and-fast rule, especially for programming languages which have functions inside of class definitions and so on, it is a decent guideline for writing the logic of your code—an excessive level of indentation probably indicates inefficient use of conditionals or heavily-nested loops which tend to be quite slow.

Regarding spacing in a statement, there should be spaces between binary or trinary operators (but not unary operators) and their operands as well as between elements in a list:

Worse

array=[1,2,3,4];
int i;
for(i=0;i<10;++i){
    printf("%d",i*i+i);
}

Better

array = [1, 2, 3, 4];
int i;
for (i = 0; i < 10; ++i) {
    printf("%d", i * i + i);
}

Also, it is acceptable to use spaces to visually align your code if it makes it easier to read (but do not use tabs, their variable width makes them inconsistent for this purpose). Once again, use your judgement. Oftentimes no alignment is really needed and it makes things look worse for those who like to program with a variable-width font.

Now, onto The Debate™…

I think there is little real difference between spaces and tabs for indentation.

That being said, I prefer tabs wherever supported for the following reasons in order of importance:

On the first point, if a programmer prefers to see a lot of whitespace to indicate levels of indentation (as I do), they can choose to leave tabs at their default width of 8 columns. If a programmer likes their code to be a little more compact, they can choose to display the width of a tab at 4 or even 2 columns. Using tabs separates the representation of indentation from the actual content of the file and allows a programmer to view the file based on their preferences.

On the second point, although disk space is quite plentiful today, bandwidth still isn’t. Even on files that are only a couple of hundreds of lines long, indenting with spaces instead of tabs increases file size by a non-trivial amount. If you have hundreds of source files across a whole project, this effect is even more noticeable. Likewise if you’re trying to optimize the size of assets you’re delivering over the wire (e.g. HTML, CSS, and JS files).

For example, this page as an HTML file indented with four spaces takes up roughly 33 kilobytes, but it only takes up 23 kilobytes if indented with tabs. This quickly compounds the more data is sent over the wire and affects everything from serving websites to cloning code repositories. I don’t like being wasteful with computing resources whenever I can help it, so this matters to me.

Regardless of any other points, nobody should be using mixed indentation or two spaces for indentation. Mixed indentation makes it difficult to follow blocks of code and two spaces is harder to read than four because of the lack of whitespace to differentiate between indentation levels and because it can encourage over-indentation since it takes more for you to run up against that 80 column line. Please don’t do what GNU does.

Given the advantages tabs have over spaces and that the only benefit spaces have over tabs is that they look the same to every programmer (which is not actually good or important given my points about programmer preference and line length), it is easy to see why tabs are a natural choice for indentation.

References

Language Specifics

Different programming languages have different conventions or best-practices that are dependent on the syntax and usage of that language. Here are some notes on various languages:

I am still expanding this section

C

C functions should be named in snake_case because of the convention that internal functions are preceded by two underscores and “namespacing” your functions by prepending a category, or type is more readable with underscores.

There is no performance difference between ++i and i++ in most compilers (unless you tell your compiler to not attempt any optimizations whatsoever).

Prefer enums over #define statements, they are easier to debug.

Avoid macros, and if you do have to use them for performance reasons, they must be as simple as possible so they are not a pain to debug.

Sometimes a “magic number” is nicer to have in the code directly, with an accompanying comment, than in a completely different section of the code as a #define. Choose to do what is more readable and easily understood.

Only typedef structs when they are supposed to be opaque to the user (i.e. they should only interact with the struct through functions, and never access fields directly). Also, separate typedefs from your structs/enums because it’s easier to read and grep for.

The return type for functions should be on a separate line so it’s easy to search a codebase for the function implementation. Also, if a function takes no arguments, be explicit about it with the void keyword. Do this:

int
main(void)
{
    return 0;
}

not this:

int main()
{
    return 0;
}

Declare variables on their own line and definitely do not mix declarations and assignments on the same line. Do:

int x;
int y;

not:

int x = 5, y;

Also, initialize variables near where they will be used, rather than at their declaration.

Avoid manual inlining or other statements that only serve to “hint” at the compiler to do something. Let the compiler handle it.

Never use unsafe versions of functions (e.g. strcpy, sprintf) even when you “know” your data will fit. You never know when someone else (or a future version of you) will come along and make a change to the code that causes the data to no longer fit, and now you have a bug at best, or a vulnerability at worst.

Don’t use alloca or allocate large arrays/structs on the stack. Prefer malloc/free.

Return values should generally be -1 or NULL for errors, 0 for success, and >0 for any other non-error state/value your function/program wishes to communicate. Make use of standard errno, perror and other such tools to set or find out exactly what went wrong.

Prefer the /* ... */ style of comments. These are trivially easy to extend into multi-line comments and I personally think they look nicer than C++ style // ... comments. Also, when writing a multi-line comment, write them like so:

/* This is a multiline comment.
 * A comment with multiple lines.
 * Many lines, such wow. */

or:

/*
 * This is a multiline comment.
 * A comment with multiple lines.
 * Many lines, such wow.
 */

Just don’t mix the two.

For switch/case statements, use “fallthrough” comments unless several case statements follow the same branch. For example:

switch (...)
{
    case 1:
    case 2:
    case 3:
        /* ... */
        /* fallthrough */
    case 4:
        /* ... */
        break;
    default:
        /* ... */
        break;
}

Don’t #include .c files and don’t use #include in header files.

Make liberal use of valgrind and other profiling/static checking tools to catch obvious mistakes.

Shell Scripting

Prefer do, then, etc. on the same line as the control statement. Since there are no braces to demarcate code blocks in shell scripts the same way there are in many programming languages, the pair of if/fi, for/done, and so on make up the visual marker of a block of code. For example:

for i in $(seq 10); do
    echo "$i"
done

if [ "$var" = "value" ]; then
    echo "var is value"
fi

Use quotes around variables wherever possible. This prevents accidental incorrect behaviour where a variable expands to a value that contains spaces or special characters that then get interpreted by the shell as being additional arguments to a program or other shell syntax.

In general, write POSIX-compliant shell programs that use portable program flags over ones that are OS or shell-specific. If non-portable flags or programs must be used on one system but not another, check for OS versions using uname -s and set variables like $cmd and $cmd_flags which can be used later on in the script to call the right program with the right flags.

Python