Some Notes on Program Style and Composition
This document describes my guidelines and philosophies for programming style and composition. Even in languages which enforce a certain style, it’s still easy to make many mistakes which can lead to programs being hard to read and less maintainable.
Keep in mind that no programming style should be treated as “high gospel”. It is ridiculous to get into flame wars over where to place braces and what kind of indentation to use. There are demonstrable pros and cons to different styles and people will choose what they like best. As long as it doesn’t produce objectively ugly, hard-to-read code, it is acceptable enough.
My overarching philosophy is that programmers should write programs for humans to read, not computers to execute. Write your code like you’re trying to explain to your fellow programmers (which includes future you) exactly how the program works and what it is doing. Code that is “cleverly” written is often code that is poorly written (unless you’re doing it for a competition or for fun).
Where possible, use common sense to make your program as readable as it can be. When working on a codebase that is not your own, follow the existing style (if there is one). If a programming language has a standardized style (e.g. Go’s gofmt or Python’s PEP-8), abide by that style; just use the standard formatter and get on with your life.
- Dependencies
- Complexity and Optimization
- Comments
- Conditions
- Naming
- Braces and Parentheses
- Line Length
- Indentation, Spacing and Alignment
- Language Specifics
Also see:
- Clean Coders Hate What Happens to Your Code When You Use These Enterprise Programming Tricks
- “Clean” Code, Horrible Performance
Dependencies
Limit the use of external dependencies as much as possible. In order of preference:
- Write bespoke libraries for your purpose
- Clone an existing library and merge it into your codebase so that you maintain it alongside your code (vendor it)
- Depend on an external library
While it is reasonable to expect some amount of dependency on external things (unless you write your own computing stack from the ground-up), limiting that as much as possible ensures that you are less likely to be hit by things like breakages due to an incompetent or malicious open source maintainer or breakages caused by an unexpected update to something you’re using. It gives your programs the best chance of being usable far into the future instead of rotting within a few months or years. Not to mention that it’s going to be far better for you as a programmer to know how to actually program things instead of just knowing how to glue together different libraries under the framework du jour.
The trend in JavaScript (npm), Python (pip), Rust (cargo), and Go (its de-facto dependencies on Google and GitHub, though it is in a much better position than the others) is very concerning for the future of programs written in these languages. In fact, I have run into issues getting older Python programs to run due to packages no longer being available and it’s hard to compile a new version of a Go program for the first time when there’s a GitHub outage. Implementation diversity and distributed sources lends itself to a far healthier ecosystem.
Complexity and Optimization
Try to minimize complexity as much as possible. Complexity does not necessarily refer to lines of code. Complexity often arises from complicated program logic, bad program design, “cleverly-written” or “clean™” code, and excessive reliance on external dependencies, where lines of code is just a symptom of these diseases.
The more code a program has, the more bugs your program can have. The more features your code has, the more ways those features can interact in unexpected ways and result in bugs. The more cleverly a program is written, the harder it is to spot bugs or debug.
Also, simple algorithms and data structures are often preferable over fancy ones. Fancier algorithms and data structures offer more opportunity for sneaky bugs and are also harder to debug than their simpler counterparts. Implement them only when you know you need them (see the paragraph below).
Finally, this is one point that I wish was hammered into every programmer’s head: measure, don’t assume the performance characteristics of your program. Too many programs are written in convoluted or confusing ways in an attempt to chase perceived performance gains without any actual benchmarking or as a result of useless micro-benchmarks.
Only tune once you understand how your actual, real-world program behaves because only then can you justify the increased complexity that comes along with bespoke optimizations and only then will you know what is worth optimizing and what is not. There is no point making a program harder to read by applying optimizations that lead to extremely marginal improvements.
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, [they] will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgements about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail.
Donald Knuth - The Art of Computer Programming
Comments
Comments should exist in code to express why something is being done a certain way or to clarify a particularly tricky bit of code that is hard to express in any other way. Comments like the following are completely useless and should not exist:
struct Node *np = &node /* Create a pointer to the node */
Comments should also not be annoyingly large banners, and big blocks of comments should be avoided. Definitely don’t put a copy of your code’s license at the top of each file; a simple, single-line statement referring to the LICENSE file is often sufficient.
If you use the “doxygen-style” of commenting before a function and at the top of a file, you must treat this as writing high-quality documentation. Otherwise it ends up being as useless as the above example in practice. I have often seen these kinds of comments used as an excuse to not write good documentation (“Look, we have doxygen-generated docs that list all the parameters; we documented this code!”) which makes these kinds of comments just useless clutter.
Great uses of comments include:
- Pointing to external documentation, a specification, or a guide that explains what is going on in the code
- TODO or FIXME markers
- Explaining a particularly complex data structure or algorithm
- Explaining the purpose of a function or code block that isn’t immediately clear
- Explaining the quirks of a particular algorithm (e.g. working around a hardware limitation)
- Explaining critical decisions (why one algorithm was chosen over another, system requirements, etc.)
Also see: The Misunderstood Concepts of Code Comments and Comments on Comments.
Conditions
It is often preferable to be explicit when checking conditions. For example:
if (x == 1) { ... }
if (ptr == NULL) { ... }
while (*s != '\0') { ... }
is clearer than:
if (x) { ... }
if (!ptr) { ... }
while (*s) { ... }
Naming
Names should be descriptive and clear in context, but not redundant or
excessive. In general, procedures should be named based on what they do and
functions should be named based on what they return. Variables and types should
be nouns (e.g. num_cakes or Parser). For example:
screen_changer();
do();
len = length_computer(x);
if (check_length(len))
is worse than:
draw();
blink();
len = compute_length(x);
if (valid_length(len))
Types, classes, structs or other similar programming structures should
be written in capitalised CamelCase, variables and functions in
snake_case, and constants in ALL_CAPS. If the components of
a variable name are short, there are a maximum of two, and they are
distinct, it is often fine to just write the name in all lowercase as
opposed to snake case. For example, maxval over max_val, but not
percenttrue over percent_true. Write what is clearest.
Also, if variables represent specific units of a measurement (e.g.
milliseconds or kilometers), this should be reflected in its name. This
is also why snake_case is preferable to camelCase for variables.
timeMs, timeMS, timeMilliseconds or mixing cases like
radioRange_km all look worse than time_ms or radio_range_km.
Another reason why snake_case is preferable is that it eliminates the
need to decide whether acronyms should be capitalized. Should you write
fetchRssFeedAsXml or fetchRSSFeedAsXML? Eliminate the need to make
that decision by just writing fetch_rss_feed_as_xml. Whatever you do,
please don’t mix the two (XMLHttpRequest is a classic real-world
example, probably done because nothing else looked better, but
xml_http_request is perfectly clear and doesn’t make capital letters
play double-duty as word boundary markers and as abbreviations).
Furthermore, it is acceptable to use short names where the meaning is
immediately clear from the context. For example, np is better than
node_pointer or nodepointer as long as you are consistent and it is
obvious what the variable refers to in a given context (e.g. you created
it like: struct Node *np = &node; or it has a very short lifetime).
Single-letter names are acceptable only in contexts in which they are
obvious or well-understood convention/notation (e.g. x and y in
mathematics, i and j for loop iterator variables).
Finally, if two variables are related in some way, they should have
consistent naming. For example, if you have a variable representing
a maximum and minimum word count, choose names like max_word_count and
min_word_count, not max_word_count and words_required.
In summary…
Worse
time = 5000
distance = 5
maximumValueUntilOverflow = 65535
ComputeResult(initialValue, newValue, modifier)
FileName = "file.txt"
coordinate_pair updatedcoordinate = (ComputeX(x), y)
class SetupTeardownIncluder()
Better
maxval = 65535
time_ms = 5000
distance_km = 5
compute_result(initial_value, new_value, modifier)
filename = "file.txt"
CoordinatePair updated_coordinate = (compute_x(x), y)
class PageBuilder()
References
Braces and Parentheses
Braces and parentheses should be used liberally to make code easy to follow.
Although many have unfortunately settled on the following brace style for functions, control statements, and the like:
if (condition) {
do_some_stuff();
and_some_more();
}
it is, in my opinion, far more readable in general to put opening braces on a new line, (Allman Style):
if (condition)
{
do_some_stuff()
and_some_more();
}
While this might appear excessive for such a trivial statement, the readability advantages quickly become clear when blocks get longer or more complex.
The additional vertical whitespace afforded by the opening brace being on a new line helps to visually separate distinct blocks of code, and it is much easier to find matching braces and keep track of block boundaries since the opening and closing braces visually line up.
It also means that, should the conditions in an if statement or the arguments
to a function grow too long to fit comfortably on one line, there is still
clear separation between the statement and its body without needing to
awkwardly double-indent the set of conditions or arguments.
For example:
if (is_logged_in(client)
&& client->assignedAddress
&& strncmp(client->username, "admin", sizeof("admin")) == 0
&& authenticate(client->password, password)) {
render_admin_panel();
}
is not as nice as:
if (is_logged_in(client)
&& client->assignedAddress
&& strncmp(client->username, "admin", sizeof("admin")) == 0
&& authenticate(client->password, password))
{
render_admin_panel();
}
Similarly:
for (s = opts; (p = strsep(&s, ",")) != NULL;) {
/* always leave space for one more argument and the NULL */
if (argc >= maxargc - 3) {
int newmaxargc = maxargc + 50;
argv = ereallocarray(argv, newmaxargc, sizeof(char *));
maxargc = newmaxargc;
}
if (*p != '\0') {
if (*p == '-') {
argv[argc++] = p;
p = strchr(p, '=');
if (p) {
*p = '\0';
argv[argc++] = p+1;
}
}
else {
argv[argc++] = "-o";
argv[argc++] = p;
}
}
}
The above code is from fsck.c in the OpenBSD codebase and is BSD-3-Clause-licensed.
is not as nice as:
for (s = opts; (p = strsep(&s, ",")) != NULL;)
{
/* always leave space for one more argument and the NULL */
if (argc >= maxargc - 3)
{
int newmaxargc = maxargc + 50;
argv = ereallocarray(argv, newmaxargc, sizeof(char *));
maxargc = newmaxargc;
}
if (*p != '\0')
{
if (*p == '-')
{
argv[argc++] = p;
p = strchr(p, '=');
if (p)
{
*p = '\0';
argv[argc++] = p+1;
}
}
else
{
argv[argc++] = "-o";
argv[argc++] = p;
}
}
}
This also means that subsequent control statements should go on their own line like so:
if (condition)
{
do();
things();
}
else
{
other();
stuff();
}
instead of:
if (condition)
{
do();
things();
} else
{
other();
stuff();
}
because it is much easier to select the else statement (for deletion, copying, etc.) if it is not on the same line as the closing brace of the if statement. I find it easier to read this way too.
Also, please never put complex single-line bodies on the same line as control statements. It’s much harder to read. Don’t do this:
for (i = 0; i < 10; i++) for (j = 0; j < 10; j++) printf("(%d,%d)\n", i, j);
Also, in general, single-line control blocks like so:
if (condition)
do_stuff();
should have surrounding braces. Although it can seem excessive, especially paired with my other preferences, it eliminates a class of errors that arises when code is added to a control block or a line is commented but you forget to also add the braces:
if (condition)
do_stuff();
do_more_stuff(); <- This is not in the if block!
if (condition)
/* do_stuff(); */ <- I just commented this...
do_other_stuff(); <- ... so now this is in the if block!
(Although whitespace-indented languages such as Python do not have to deal with this issue, that brings along other issues with code readability due to blocks not being as clearly delineated.)
Plus, I find it more ergonomic to be able to quickly comment out or add statements without also then having to add or remove braces to keep the style consistent or the code correct. I also find it easier to read when control statements are next to each other such as in:
if (condition)
{
continue;
}
if (other_condition)
{
printf("Other condition reached!\n");
do_something();
}
compared to:
if (condition)
continue;
if (other_condition)
{
printf("Other condition reached!\n");
do_something();
}
Line Length
Hard limits on line length are outmoded and poorly reasoned. Unless you are literally programming with punched cards or using an extremely limited display (e.g. you’re programming for a retro computer), there is probably no good reason to set a hard limit on the length of lines.
However, you should strive to keep lines within a reasonable length, so you and other programmers who work with your code do not have to scroll a reasonably-sized editor to see the ends of a line. Limiting the length of your lines to something reasonable also allows you to have multiple editor windows or panes side-by-side without needing to scroll each of them.
Additionally, shorter line lengths are easier for people to read since long lines make it harder to find the beginning of the next line. While this mostly applies to prose, it can also apply to code in things like code comments or dense blocks of code that are awkward to split up.
So avoid excessively long lines, but don’t stress about a line that is 81 or even 90 characters long. Use common sense. If your code is more readable wrapped to smaller lines, then do that. If wrapping would make your statement look awkward and it’s not too long, then don’t.
However, it’s still worth it to set up your editor to display a line at 80 columns because this can be a helpful guideline to help you notice when you’re indenting excessively or writing something convoluted.
Indentation, Spacing and Alignment
Before touching on the hot-button topic a few notes about other aspects:
In general, if you find your code being heavily nested, this should be a sign to refactor or reconsider your approach. An oft-repeated adage (originating from Linus Torvalds) is that “if you need more than 3 levels of indentation, you’re screwed anyway, and should fix your program.” While this is not a hard-and-fast rule, especially for programming languages which have functions inside of class definitions and so on, it is a decent guideline for writing the logic of your code—an excessive level of indentation probably indicates inefficient use of conditionals or heavily-nested loops which tend to be quite slow.
Regarding spacing in a statement, there should be spaces between binary or trinary operators (but not unary operators) and their operands as well as between elements in a list:
Worse
array=[1,2,3,4];
int i;
for(i=0;i<10;++i){
printf("%d",i*i+i);
}
Better
array = [1, 2, 3, 4];
int i;
for (i = 0; i < 10; ++i) {
printf("%d", i * i + i);
}
Also, it is acceptable to use spaces to visually align your code if it makes it easier to read (but do not use tabs, their variable width makes them inconsistent for this purpose). Once again, use your judgement. Oftentimes no alignment is really needed and it makes things look worse for those who like to program with a variable-width font.
Now, onto The Debate™…
I think there is little real difference between spaces and tabs for indentation.
That being said, I prefer tabs wherever supported for the following reasons in order of importance:
- It is possible to configure how wide a tab appears, so a programmer can choose the width of indentation they prefer
- The tab character is one character, whereas spaces are multiple characters and result in larger files
- Semantically, one tab character equals one level of indentation
On the first point, if a programmer prefers to see a lot of whitespace to indicate levels of indentation (as I do), they can choose to leave tabs at their default width of 8 columns. If a programmer likes their code to be a little more compact, they can choose to display the width of a tab at 4 or even 2 columns. Using tabs separates the representation of indentation from the actual content of the file and allows a programmer to view the file based on their preferences.
On the second point, although disk space is quite plentiful today, bandwidth still isn’t. Even on files that are only a couple of hundreds of lines long, indenting with spaces instead of tabs increases file size by a non-trivial amount. If you have hundreds of source files across a whole project, this effect is even more noticeable. Likewise if you’re trying to optimize the size of assets you’re delivering over the wire (e.g. HTML, CSS, and JS files).
For example, this page as an HTML file indented with four spaces takes up roughly 33 kilobytes, but it only takes up 23 kilobytes if indented with tabs. This quickly compounds the more data is sent over the wire and affects everything from serving websites to cloning code repositories. I don’t like being wasteful with computing resources whenever I can help it, so this matters to me.
Regardless of any other points, nobody should be using mixed indentation or two spaces for indentation. Mixed indentation makes it difficult to follow blocks of code and two spaces is harder to read than four because of the lack of whitespace to differentiate between indentation levels and because it can encourage over-indentation since it takes more for you to run up against that 80 column line. Please don’t do what GNU does.
Given the advantages tabs have over spaces and that the only benefit spaces have over tabs is that they look the same to every programmer (which is not actually good or important given my points about programmer preference and line length), it is easy to see why tabs are a natural choice for indentation.
References
Language Specifics
Different programming languages have different conventions or best-practices that are dependent on the syntax and usage of that language. Here are some notes on various languages:
I am still expanding this section
C
C functions should be named in snake_case because of the convention that
internal functions are preceded by two underscores and “namespacing” your
functions by prepending a category, or type is more readable with underscores.
There is no performance difference between ++i and i++ in most compilers
(unless you tell your compiler to not attempt any optimizations whatsoever).
Prefer enums over #define statements, they are easier to debug.
Avoid macros, and if you do have to use them for performance reasons, they must be as simple as possible so they are not a pain to debug.
Sometimes a “magic number” is nicer to have in the code directly, with an
accompanying comment, than in a completely different section of the code as
a #define. Choose to do what is more readable and easily understood.
Only typedef structs when they are supposed to be opaque to the user (i.e.
they should only interact with the struct through functions, and never access
fields directly). Also, separate typedefs from your structs/enums because it’s
easier to read and grep for.
The return type for functions should be on a separate line so it’s easy to
search a codebase for the function implementation. Also, if a function takes no
arguments, be explicit about it with the void keyword. Do this:
int
main(void)
{
return 0;
}
not this:
int main()
{
return 0;
}
Declare variables on their own line and definitely do not mix declarations and assignments on the same line. Do:
int x;
int y;
not:
int x = 5, y;
Also, initialize variables near where they will be used, rather than at their declaration.
Avoid manual inlining or other statements that only serve to “hint” at the compiler to do something. Let the compiler handle it.
Never use unsafe versions of functions (e.g. strcpy, sprintf)
even when you “know” your data will fit. You never know when someone
else (or a future version of you) will come along and make a change to
the code that causes the data to no longer fit, and now you have a bug
at best, or a vulnerability at worst.
Don’t use alloca or allocate large arrays/structs on the stack. Prefer
malloc/free.
Return values should generally be -1 or NULL for errors, 0 for
success, and >0 for any other non-error state/value your
function/program wishes to communicate. Make use of standard errno,
perror and other such tools to set or find out exactly what went
wrong.
Prefer the /* ... */ style of comments. These are trivially easy to
extend into multi-line comments and I personally think they look nicer
than C++ style // ... comments. Also, when writing a multi-line
comment, write them like so:
/* This is a multiline comment.
* A comment with multiple lines.
* Many lines, such wow. */
or:
/*
* This is a multiline comment.
* A comment with multiple lines.
* Many lines, such wow.
*/
Just don’t mix the two.
For switch/case statements, use “fallthrough” comments unless several case
statements follow the same branch. For example:
switch (...)
{
case 1:
case 2:
case 3:
/* ... */
/* fallthrough */
case 4:
/* ... */
break;
default:
/* ... */
break;
}
Don’t #include .c files and don’t use #include in header files.
Make liberal use of valgrind and other profiling/static checking tools to
catch obvious mistakes.
- C Programming in Plan 9 from Bell Labs
- Notes on Programming in C by Rob Pike
- BadDiode C Style Guide
- Sigrid’s C Style
Shell Scripting
Prefer do, then, etc. on the same line as the control statement. Since there
are no braces to demarcate code blocks in shell scripts the same way there are
in many programming languages, the pair of if/fi, for/done, and so on
make up the visual marker of a block of code. For example:
for i in $(seq 10); do
echo "$i"
done
if [ "$var" = "value" ]; then
echo "var is value"
fi
Use quotes around variables wherever possible. This prevents accidental incorrect behaviour where a variable expands to a value that contains spaces or special characters that then get interpreted by the shell as being additional arguments to a program or other shell syntax.
In general, write POSIX-compliant shell programs that use portable program flags
over ones that are OS or shell-specific. If non-portable flags or programs must
be used on one system but not another, check for OS versions using uname -s
and set variables like $cmd and $cmd_flags which can be used later on in the
script to call the right program with the right flags.