Free software by Richard A. O'Keefe

Lenses for Erlang

lens.erl is a crude lens library for Erlang. It represents a lens as a triple {Get,Put,Upd} where

Get :: (A) → B
Put :: (A, B) → A
Upd :: (A, (B) → B) → B.

Think of a lens as a way of accessing a B field from an A value.

lens1.erl is an earlier draft that used a pair {Get,Put} — this works but is less efficient for updates. This whole thing really needs cross-module inlining to work well, but given that, it would solve the "how do I update nested fields" problem.

A faster random number generator for Java.

Random48.java uses the same 48-bit linear congruential generator as java.util.Random, but designing it not to be shared between threads (which doesn't really make sense for a random number generator anyway) makes it 5 times faster. It also provides a useful .toString() and a constructor that takes a String() argument, with the text representation looking like a Java identifier, so you can easily save and restore Random48 states using plain text files. RandomTime.java is the program I used to measure the speed of the generator.

Linear congruential generators have known limitations, so if you seriously want the best available generator, this isn't it. This is for you if you were happy with the results from java.util.Random but wished it were not so sluggish.

Lists with O(1) concatenation and reversal

The Reverse-Append-Unit-eMpty data structure is a list-like data structure that has constant time append and reverse, and linear time batch operations. Head and tail are not constant time; they are not even logarithmic time in the worst case. This is not a very sophisticated data structure, but that's the point: to show that quite an unsophisticated data structure can do surprisingly well. This version is in SML and works with SML/NJ 110.70 and MLton June 15, 2009.

raum-sig.sml, the interface.
raum.sml, the implementation.
raum-time.sml, some timing code.

ANSI Smalltalk Compiler and library

For some years I have been building a static compiler for ANSI Smalltalk. This compiles a variant of ANSI Smalltalk -- all the ANSI classes and methods in ansi.st, plus a lot more in other files -- to ANSI C89, which can then be compiled by gcc or Sun's C compiler. It is routinely tested on SPARC/Solaris and Intel/OSX; it is also periodically built and tested on Intel/OpenSolaris, Intel/Linux, Intel/OpenBSD (versions from 4.7 to 5.6), and Intel/Windows 7 + SUA and Cygwin. I have acquired ARM and MIPS tiny boards with the intention of testing on this, but have been too busy as yet.

Some things are still missing or incomplete:

Mixins (not part of ANSI Smalltalk, but I've often wished for them). I may in fact never do this; Latent Methods are a much simpler scheme that meets all my needs and work *has* begun on that.
Sockets (again, not part of ANSI Smalltalk)
The new date library is incomplete, but functional. The ANSI Duration and DateAndTime classes are done as are traditional Date and Time classes.
Bitwise operations on negative large integers don't work (and the ANSI standard does not require them to). Bignum division is not yet coded. NB: that's because all bignum support code is original.
I don't normally bother linking with the garbage collector, though it has worked, nor have I done any of the tuning of the garbage collector that I'd like to.
#ensure: and #ifCurtailed: are there and usable, and even usable for catching long returns, but they don't catch exceptions because
exception handling is not there yet. There are plans, but before I implement exception handling I need to tidy up which exceptions get raised. The weird thing is that ANSI Smalltalk introduced this immensely complex exception handling system but provided almost no exceptions, with the result that code with exception handlers isn't portable anyway.
A front end for ANSI syntax and classic change set syntax has been begun but not yet finished. GNU Smalltalk syntax is not supported; it turns out to be way more complicated than it looks. What I really need is something going the other way.
The vector math library exists but as yet only the unary functions are fast.

The main thing that's woefully incomplete is the documentation, although

the main manual (st.{tex,pdf}) is 119 pages
the POSIX binding (susv4.{tex,pdf{) is 98 pages
the description of floating-point (float.{tex,pdf}) is 29 pages
the streams manual (streams.tex, streams.{tex,pdf}) is about 14 pages but is massively incomplete
the exception handling document (exceptions.{tex,pdf}( is 10 pages
the multithreading manual (multi.{tex,pdf}) is just 4 woefully incomplete pages but does at least list all the major classes.

so that's about 274 pages of documentation...

Source code in Smalltalk comes to

Raw lines SLOC area
183k 106k library
24k 18k test files
26k 17k examples
30k 20k RosettaCode solutions

Perhaps the main departure from Smalltalk common practice is that I will not use (though I do provide) #shouldNotImplement. Of course, having a batch compiler instead of an IDE, and allowing embedded C instead of calling primitives, aren't traditional either.

While it's not finished it is already useful and all of the code is completely free to anyone who wants it for any purpose as long as they don't claim credit for it. In particular, Smalltalk code in the libraries may be used freely by the maintainers of any Smalltalk system, including commercial ones.

The code is in astc-1711.tar.

One thing I was particularly interested in was just how well a fairly naive compilation-via-C strategy would go, especially considering that dynamic dispatch is done by binary search -- an idea swiped from SmallEiffel/SmartEiffel. The answer is that it goes very well, about as well as VisualWorks non-commercial, sometimes better, sometimes worse. One test, processing a 180 MB XML file, took 12 times as long as the corresponding C code, but considering all the work it takes to stuff a character into a string, that's going to be hard to improve much without flow sensitive type inference, which currently isn't tried. This was after all started to provide a naive reference point. Perhaps the weakest performance issue is the boxing and unboxing of floating-point numbers, and I do have ideas about that which I have not begun to work on yet.

The major practical problem is that the C file it produces is hundreds of thousands of lines of C.

This Smalltalk system is also serving as an education in Unicode. Currently, the compiler only accepts Latin 1, but the run time system handles UTF8 and a tolerably wide range of 8-bit encodings. By wrapping byte streams, you can also use 16- and 32-bit Unicode, and even SCSU. The next major Unicode task will be handling character classification.

Comparing files against model answers

pcfpcmp was written for Programming Contest judges to use, so that problems with floating point numbers in their output could be used. Details are in the file README.pcfpcmp and source code and an example are in an uncompressed `tar' file pcfpcmp.tar (33kB).

Token styling and extracting for several programming languages

m2h is a program that can tokenize most of the programming languages I use, after a fashion. It can be used for at least five purposes:

Computing SLOC counts:
```
m2h -fs sourcefile...
```
Stripping out comments for other metrics tools:
```
m2h -ex sourcefile...
```
Checking the spelling of comments:
```
m2h -iv sourcefile | spell -b
```
Converting to LaTeX, Troff, or HTML for printing or on-line viewing:
- m2h -fl foo.c >foo-c.tex
- m2h -fr foo.adb > foo-adb.r
- m2h -fh foo.erl > foo-erl.htm
Colouring tokens, not least so you can see where comments and strings are. m2h -fc emits ANSI terminal control sequences for the colours; m2h -ff emits HTML with <font> tags.

The sources are provided as a gzipped tar file, m2h.tgz. Documentation is in the README file, plus in core.c.

Beware: this is not polished code, to put it kindly. There are a couple of features that are not implemented yet. It's fairly easy to add new languages except for a minor issue (which has held up support for Lisp and Haskell) and a major one (which hasn't been a problem for me yet, but will be). The minor issue is that the tokenising framework doesn't handle nesting comments. The major issue is that there is no support at all for wide characters or for encodings other than ISO Latin 1, and as part of this, C/C++/Java Universal Character Names (\uxxxx \Uxxxxxxxx) are not processed correctly. It's free software and worth what you paid for it, but I have found it very useful and you may too.

Beware: this is not a pretty-printer, just a token styler. It does not add or remove line breaks or indentation for any language. Although I've had some ideas about adjusting inter-token spacing, they have NOT been fully implemented. All this does is filter tokens and maybe add some markup.

UnRolled Strict Lists

Unrolled strict lists are spine-strict lists which have been unrolled, in this case by a factor of four, so that each "box" contains 4 elements, not 1. This saves space, and should also save time. I've provided versions for Haskell (Ursl.hs) and Clean (Ursl.dcl and Ursl.icl). The Clean versions were for Clean 1.3 and have not been tried in Clean 2.x. If anyone is interested I'll be happy to whip up an SML version. An Erlang version also exists, but has not yet been tested.

Four XML tools:

Prolog Well-Formed Pages, an idea for using Prolog as a server-side scripting language in such a way that templates will be well-formed and can possibly even be validated. In fact, several of the example templates are valid XHTML documents as they stand. Only attributes are added, and not many of those. The current version is a proof-of-concept implementation, and requires SWI Prolog. The full distribution includes some test cases. If you want parser input to be valid, not just well-formed, you need a DTD. You can use this AWK script to add the pwp: attributes to a DTD; there is a version of XHTML 1 Strict that has been patched this way.
Oxus, a non-validating parser for XML in Squeak Smalltalk. It does Namespaces, and all of the data structures are now in place for it to parse DTDs, but it doesn't yet do that.
qh, a non-validating parser for XML in C. It doesn't (yet) understand URIs. But it's small, extremely fast (you thought expat was fast? Hah!), and generates output in several formats, including ESIS, making it a good front-end for tools written in AWK, Lisp, &c. Work is in progress to support Unicode (input decoding is finished; output encoding is under way) and to improve its validation abilities.
wbxml, a decompressor for WBXML (the WAP Binary XML compressed form) in C. This is table-driven: loaddict converts a textual list of elements, attributes, and strings to a suitable decoding table in C.

The files are

pwp.tgz, the "full" distribution of PWP for SWI Prolog. Currently that's the Prolog file plus some small test cases plus the DTD-patching AWK script and the patched XHTML1 DTD.
pwp.pl, the PWP transformation kit for SWI Prolog. Documentation is in the big comment at the top.
Oxus.15.cs, the Oxus parser as a Squeak changeset.
Makefile
counter.c
dict1.wbd, the decoding dictionary for
test1.wbx, a test case for WBXML.
dict2.wbd, the decoding dictionary for
test2.wbx, another test case for WBXML.
html.wbd, the decoding dictionary for (X)HTML.
wml.wbd, the decoding dictionary for WML.
wml1.wbx, a WML test case.
wbxml.h, header for WBXML decompressor.
wbxml.c, the decompressor.
driver.c, the WBXML main program.
loaddict.c, the decoding dictionary compiler for wbxml.
qh.h, header for XML parser in C.
qh.c, the XML parser.
kind.h, header for
kind.c, character classification for qh.
output.c, output module for qh and wbxml.
counter.c, counter module for qh and wbxml.

Raw lines	SLOC	area
183k	106k	library
24k	18k	test files
26k	17k	examples
30k	20k	RosettaCode solutions