Notes for Lecture COSC345 Lecture 17, 2016

Names

Grammar of names

Names of commands should be imperative phrases, like "delete-file".
Names of Boolean variables or functions should be adjectives or adjective phrases. Some people prefer putting a form of the verb to be with them, like "is.red", or "was.previously.reported". Some prefer not to, "available", "frequently_requested".
Names of other data or functions should be nouns or noun phrases, like "unsolved_cell_count".
People argue over whether variables with collection values should be given plural names like "taken_pieces" or singular names like "taken_piece". If you are going to refer to the whole collection,
```
taken_pieces count: [:each | each is_pawn]
```
reads quite naturally. If you are going to refer to single elements,
```
int pawn_count = 0;
for (int i = 0; i < taken_piece_count; i++)
    if (taken_piece[i].is_pawn())
        pawn_count++;
```
reads better. The same person writing the same algorithm in C and Smalltalk might reasonably use a singular name in C and a plural name in Smalltalk.
In languages where variables precede types in declarations, types should be named so that v: t can be pronounced as "v is a t." (This is another reason to give collections singular names.) In languages that put types first, like Java, there is no sensible way to pronounce declarations anyway.

Separate your words

Alongtimeagowhenpeoplestartedtowritetheyrantheirwordstogether. Suchtextisnoteasytoread. ItIsNotMuchBetterWhenYouUseInternalCaps, TheBaStudlyCapsStyle, BecauseWhatYouGetIsStillLongBlackBlobs. DoNotBeSoRudeToYourReadersIfYouCanPossiblyAvoidIt.

People have to be able to decode identifiers. Distinguish a workplace that is unionised from a chemical that is un_ionised. Distinguish a man who works for the UN, a UN_man, from a mutant who isn't quite human, an unman. You may want to distinguish things that have capital letters in ordinary text (like proper names and acronyms) from things that do not, so it is unwise to use capitals for separation. For example, George Boring Facebook picture of a George Boring presumably isn't boring. According to a family name site, Superman was drawn in the 1940s and 1950s by a Boring artist (but not a boring one).

Compare

Romberg integral of(f) from:(0) to:(1) epsilon:(0.01) — Algol 60
RBGINT(F, 0, 1, 0.1) — Fortran 66
(Romberg-integral :fun f :from 0 :to 1 :epsilon 0.01) — Lisp
Romberg.integral(FUN=f, from=0, to=1, epsilon=0.01) — R
Romberg_Integral(Fun => F, Lower_Bound => 0, Upper_Bound => 1, Epsilon => 0.01) — Ada
RombergIntegral of: f from: 0 to: 1 epsilon: 0.01 — Smalltalk-80
Romberg_Integral of: f from: 0 to: 1 epsilon: 0.01 — modern Smalltalk
rombergIntegral(f, 0, 1, 0.01) — Java

If you are not already familiar with Romberg integration, in which of these languages is it obvious that Romberg is a proper noun? In which might Romberg be confused with a colour space?

Make the breakdown of your names into separate words obviously unambiguous.
Do not confuse Boring people with boring people.
Use keyword arguments if you have them.

Completely self-documenting code is impossible.

Look at the examples above. Three of the arguments are numbers. There are six possible ways they might be ordered. The Java version cannot be self-documenting.

If you are not familiar with Romberg integration, what does the name Romberg_integral tell you, all by itself?

The staff member who wrote the original version of these slides exhorting you to "Aim for ... completely self-documenting code" has an average of one comment line for every 12 SLOC in his magnum opus. And in my view, because I have struggled to understand it, it does not have enough comments! And he was aiming for self-documenting code.

Nobody struggling to understand an unfamiliar body of code ever said “I wish this had fewer helpful comments.”

Distinguishing constants from variables.

Some people are adamant that you should WRITE CONSTANTS IN UPPER CASE so that you can tell them from constants. (Sorry, tell them from variables. It doesn't really make sense either way.) In antique C, where constants were normally declared as macros, the const keyword not having been adopted, that made sense, because mutable variables and #defined constants followed different scope rules.

Other people are adamant that variables and constants should be named exactly the same way. After all, if a variable isn't changed in some region of code, why do you even care? (Java 8 has the notion of “effectively final” variables.) We have better things to spend notational capital on.

I suggest a meta-guideline. If a programming language is such that you routinely need to know whether a name is a constant or a variable, then by all means use capitalisation to distinguish them. If, however, they are mostly interchangeable, then don't.

Don't use case style to distinguish constants from variables unless the reader of your code needs to know.

On the importance of comments

Consider the problem of specifying a point near the Earth's surface. We clearly need at least 3 numbers, and in order to recognise the same point if we see it again, it's clear that we'd like these co-ordinates to be referred to a reference frame that rotates rigidly with the Earth.

Centuries of tradition tell us that the answer is latitude (North/South), longitude (East/West), and height above mean sea level.

So now we have

typedef struct geo {
    double lat, lon, hgt;
} geo;

What do we have?

What are the units of latitude? Radians? Grads? Full circle = 1? Degrees? Seconds? Millidegrees? Tradition tells us that it's degrees.
What is the origin of latitude? Tradition sets it as 0° = the equator. But it could have been the North Pole.
What is the range of latitude? Tradition tells us that -90° = as far south as possible and +90° = as far north as possible.
What are the units of longitude? It would be silly if latitude and longitude were different, so say degrees.
What is the origin of longitude? Whatever it is, it's called the Prime Meridian. Tradition in our culture says it's the longitude line running through the Greenwich Observatory.
What is the range of longitude? Tradition tells us that -180° = as far west as you an go, +180° = as far east as you can go.
What are the units of height? Let's be SI and use metres.
What is the origin of height? Tradition says "mean sea level". How exactly you measure that in, say, the middle of Australia, has never been clear to me... It would make sense to measure from the centre of the Earth, except that mean sea level has always been rather more accessible.
What is the range of height? Objects in orbit below about 200 km typically don't stay in orbit very long. (The ISS is 400 km up and needs periodic shoves to stop it falling down.) The deepest mine is about 3.9 km deep. The deepest ocean trench is about 10.9 km deep. The Mohorovičić_discontinuity is as much as 90 km below the level of the continents, and people have been interested in drilling down to it. So if we take a range of -100 km to +200 km, that might be reasonable.
What direction is “up” anyway? The direction from you straight towards the centre of the earth and the direction that gravity is pulling you are not the same direction. They're close, but gravimeters can tell the difference, and if |height| is big enough, it will matter.
What precision do we need for these? If 360° = one Earth circumference, then we need 1 part in 10⁹ precision to represent a point on the surface to better than 1 m accuracy, and about 1 part in 10⁶ to get height that good.

So now we have reached

typedef struct geo {
    double lat; //  -90 (S) to  +90 (N) degrees, 0=equator
    double lon; // -180 (W) to +180 (E) degrees, 0=Greenwich
    double hgt; // -100 to +200 km, 0 = mean sea level
} geo;

Are we done yet? And did we really need those comments? Couldn't we have used

typedef struct geographical_location_3d {
    double latitude_in_degrees_north_of_equator;
    double longitude_in_degrees_east_of_Greenwich;
    float  height_in_metres_above_mean_sea_level;
} geographical_location;

and had self-documenting code?

Just how long do you think you would be willing to type those names?

At this point, someone is bound to say “but my IDE offers completion based on the first few letters, so I don't have to type much”. That works fine until you need

latitude_in_degrees_north_of_equator
latitude_in_seconds_north_of_equator
longitude_in_degrees_east_of_Greenwich
longitude_in_degrees_east_of_Paris

in the same program.

Above all, those names do not tell us everything we desperately need to know!

There are two systems of coördinates for Mars: planetographic or areographic coördinates are referred to the mean surface of the planet and are used for observations of surface features, while planetocentric or areocentric coördinates are referred to the equatorial plane and are used for celestial mechanics. Thus astronomers would use different coördinates to refer to the present location of Deimos and the point directly below it on Mars. The way latitude is measured differs between the two systems due to the oblateness of rotating planets, just like it does on Earth. Planetographic longitude runs from 0° to 360° increasing to the west; planetocentic longitude runs from 0° to 360° increasing to the east. (Most) maps of Mars made before 2002 used planetographic latitude with west longitude. Newer (and some older) maps use planetocentric latitude with east longitude. (Source, IAU Gazetteer of Planetary Nomenclature. The potential for confusion is dizzying. For us, the point is that “degrees east of Greenwich” doesn't tell us whether 1° east is -1 or 359.
We are missing all range and precision information.
And “height in metres” does not tell us whether it's local-gravity-down or towards-the-centre-of-the-Earth down.
There are in fact many geographic coördinate systems in use on Earth, using many "reference ellipsoids". "To date, there exist organizations around the world which continue to use historical prime meridians which existed before the acceptance of Greenwich became common-place." (pag. cit.)
From the latitude page, "Many different reference ellipsoids have been used in the history of geodesy. In pre-satellite days they were devised to give a good fit to the geoid over the limited area of a survey but, with the advent of GPS, it has become natural to use reference ellipsoids (such as WGS84) with centres at the centre of mass of the Earth and minor axis aligned to the rotation axis of the Earth. These geocentric ellipsoids are usually within 100 m of the geoid." That is, height relative to the WGS84 ellipsoid can be ±100 m different from height relative to mean sea level. If you tried to fly a drone to a friend's location as reported by a GPS unit, you might try to land it 100 m in the air, or part of its journey might be through soil. Oops!
ISO 6709, therefore, says that each set of coördinates should come with an explicit or contextually determined Coördinate Reference System (CRS) identifier, such as WGS84. The draft standard shows that latitude and longitude can refer to points over 100 m apart if you don't do that. In fact, the WGS84 prime meridian goes about 100 m away from Greenwich.
If you want to recognise the same position in a few years, you don't want higher precision than I suggested above, because continental drift means places are moving about 1-10 cm a year.

Some programming languages let you say more than others. For example, we can express range and precision information precisely in Ada, where the compiler can see them, check them, and take advantage of them. Here's what it looks like.

-- ISO 6709:2008 geographical point representation.
-- The Coordinate Reference System (CRS) WGS_84
-- (World Geodetic System 1984, as revised in 2004) is always used.
-- Latitude and longitude are measured in degrees.
-- +ve latitude is north; +ve longitude is east (0 = prime meridian).
-- Height is measured in metres.
-- The deltas are chosen for about 10 cm resolution.

type Latitude_Range
  is delta 0.000_001 digits 8 range    -90.0 ..    90.0;
type Longitude_Range
  is delta 0.000_001 digits 9 range   -180.0 ..   180.0;
type Height_Range
  is delta 0.1       digits 7 range -100_000 .. 200_000;
type Geographic_Location
  is record
     Latitude  : Latitude_Range;
     Longitude : Longitude_Range;
     Height    : Height_Range;
  end record;

The Novopay system is implemented in Oracle's PL/SQL, which lets you write this:

-- ISO 6709:2008 geographical point representation.
-- The Coordinate Reference System (CRS) WGS_84
-- (World Geodetic System 1984, as revised in 2004) is always used.
-- Latitude and longitude are measured in degrees.
-- +ve latitude is north; +ve longitude is east (0 = prime meridian).
-- Height is measured in metres.
-- The deltas are chosen for about 10 cm resolution.

DECLARE
  SUBTYPE Latitude_Range      IS NUMERIC(8,6);
  SUBTYPE Longitude_Range     IS NUMERIC(9,6);
  SUBTYPE Height_Range        IS NUMERIC(7,1);
  TYPE    Geographic_Location IS RECORD (
             Latitude  Latitude_Range,
             Longitude Longitude_Range,
             Height    Height_Range);

where we can state the precision but not the true range.

We can't even do that in C. The best we can do is

/*  ISO 6709:2008 geographical point representation.
    The Coordinate Reference System (CRS) WGS_84
    (World Geodetic System 1984, as revised in 2004) is always used.
    Latitude and longitude are measured in degrees.
    +ve latitude is north; +ve longitude is east (0 = prime meridian).
    Height is measured in metres.
    We want to have about 10 cm resolution, so single precision
    floats are NOT adequate for latitude & longitude.
*/
typedef double latitude_range;  // -90 to +90
typedef double longitude_range; // -180 to +180
typedef double height_range;    // -100 km to +200 km in m.
typedef struct geographic_location {
    latitude_range   latitude;
    longitude_range  longitude;
    height_range     height;
} geographic_location;

Rules about comments.

If there is important information about something in your program which cannot be expressed in your programming language, it must be expressed in the name or in a comment.

If the information should be expressed every time the thing is mentioned and can be expressed in the name, put the information in the name. If the information is too bulky to go in the name, it must go in a comment.

Interfaces must be commented adequately, and it is the client of the interface who determines what counts as adequate, not the author of the implementation.

Comments do not have to be in-line

There are annotation editors that let you keep a file of annotations in parallel with a source file. This is especially useful if you want to make notes about a source file that you are not able for some practical or legal reason to edit.

The Eclipse IDE supports such annotations.

Review Board allows annotations as part of its support for code review. You should visit the Review Board web site and take a look.

There is a package 'annot.el' for Emacs which offers incredibly simple annotation. There are several ports of Emacs for Mac OS X, but AquaMacs would be top of the list. Annotations are stored in files in your ~/.annot directory. As long as you don't change the file, or only change it through Emacs with annot.el loaded, no problems. Change it with another editor, the MD5 hash changes, and annot.el doesn't recognise the file. Still some work to do there. But it is perfect for annotating things you don't change.

Annotations are also great for tools. Compiler error messages can be converted to annotations. Intel have a tool that can generate parallelism annotations. Seeing “comments” from a static checker in situ without changing the file is handy.

What Lind and Vairavan really found

One of the slides says “Highly commented parts of code have the highest error rate (Lind and Vairavan 1989). What did Lind and Vairavan actually find?

They studied one program. It was a big one, with thousands of procedures, some in (old) Fortran and some in Pascal.

There is a dogma that short functions are best, that anything over one page is bad. They actually found that short functions (1-50 lines) had nearly twice the error rate of medium ones (51-100 lines).

They found that comments are correlated with errors for two reasons.

The number of comments in a procedure was strongly correlated with its length.
The number of errors in a procedure was strongly correlated with its length.

They also found that the slide has causality backwards. The slide seems to say “if you put lots of comments in your code, it will then turn out to have lots of errors”. In fact they suggested the opposite: code that had lots of errors ended up with lots of comments.

That's why the next line on the slide says “Don't comment tricky code, re-write it!”. Of course, that assumes that it is possible to eliminate the trickiness by rewriting. If only that were true.

Indentation

First we get world peace. Then we convert everyone to the same religion. Then we eliminate poverty. And then we get people to agree on a layout style.

Some things are not a matter of taste.

The purpose of indentation is to reveal structure.
To reveal structure, indentation must be consistent.
Using the TAB key to indent is stupid because nobody agrees about where tab stops are. The UNIX standard is every 8 columns. Xcode has Cmd-] to indent and Cmd-[ to outdent. Vi has >> to indent by shiftwidth and <lt; to outdent by shiftwidth. Emacs has more indentation support than you can shake a directory tree at, starting with Ctrl-X TAB.

See also Rob Pike's Notes on Programming in C.