Why are patents hard to read? — Nominalisation

My experience is that patents in general are hard to read and that software patents are harder, despite software being my area of expertise. But is that really so? What is the language of patents like, and is the language of software patents any different from other patents?

From the Wikipedia article on MAREC,

The MAtrixware REsearch Collection (MAREC) is a standardised patent data corpus available for research purposes. MAREC could be defined as a corpus that seeks to represent patent documents of several languages in order to answer specific research questions.[1][2] It consists of 19 million patent documents in different languages, normalised to a highly specific XML schema.

Of those 19 million patents, 5,639,471 are United States patents. (They apparently do not hold any New Zealand or Australian patents.) A random sample of 100,000 US patents is available. This is a good size for preliminary experiments. It takes about half an hour of 2.5 GHz Intel Core 2 Duo CPU time to process these files to the point of having word lists; part of speech tagging takes about two calendar days.

The paper A Corpus-Driven Approach to Genre Analysis: The Reinvestigation of Academic, Newspaper and Literary Text, by Yasunori Nishina of the University of Birmingham, published in Empirical Language Research, Vol 2, ISSN 1746-6830 (online) reports on, amongst other things, the use of nominalization in four corpora: academic articles, newspaper articles, literature, and a general collection. To this we can now add patents. The leftmost column are my numbers. The other columns are taken from Nishina's table 10.

The last two rows are taken from Nishina's table 2. The first of them is the mean word length, and the last is the proportion of words that are 1-4 letters long. The numbers in the first column are mine, the others his.

Patents Academic Newspaper Literature General

-tion 67% 55% 43% 41% 50%
-sion 6% 8% 10% 10% 9%
-ness 1% 4% 7% 14% 6%
-ment 17% 15% 22% 20% 18%
-ity 9% 18% 18% 15% 17%
Total 36.5‰ 31.8‰ 23.0‰ 9.6‰ 22.5‰

short 49.8% 55.4% 55.2% 62.9% 57.5%
ave.len. 5.04% 4.86% 4.78% 4.30% 4.68%

	Patents	Academic	Newspaper	Literature	General
-tion	67%	55%	43%	41%	50%
-sion	6%	8%	10%	10%	9%
-ness	1%	4%	7%	14%	6%
-ment	17%	15%	22%	20%	18%
-ity	9%	18%	18%	15%	17%
Total	36.5‰	31.8‰	23.0‰	9.6‰	22.5‰
short	49.8%	55.4%	55.2%	62.9%	57.5%
ave.len.	5.04%	4.86%	4.78%	4.30%	4.68%

We see that "literature", meant to be read for enjoyment, uses nominalisation least (9.6‰), and patents (36.5‰) exceed even academic articles (31.8‰), known for their dryness. Nishina quotes Biber et al as pointing out that -ness words "often describe personal qualities", so it's not surprising that Nishina's literature corpus uses them most often and the MAREC 100k patent corpus uses them least.

There are 774,452,756 occurrences of 1,749,380 distinct tokens in the 100k patent collection. Three quarters of these (75.6%) come from just 850 distinct tokens. The nominalisation forms found amongst those are

-tion: Twenty-seven words: invention, portion(s), information, position(s), operation, section, direction, application(s), solution, reaction, composition, communication, function, addition, connection, configuration, condition(s), description, detection, location, combination, rotation, station, concentration, formation, production, selection.
-sion: Two words: transmission, expression.
-ness: One word: thickness.
-ment: Five words: embodiment(s), element(s), treatment, movement, arrangement.
-ity: Three words: plurality, activity, density.

There are 605,619,546 occurrences of 919,369 distinct words in the 100k patent collection. Here are the 20 words that occur most often:

Individual Cumulative
percentage percentage Word

8.64% 8.64% the
4.11% 12.75% of
3.56% 16.31% a
2.57% 18.88% and
2.54% 21.42% to
2.13% 23.55% in
1.90% 25.44% is
0.93% 26.37% for
0.82% 27.19% or
0.81% 27.99% as
0.79% 28.79% be
0.78% 29.56% an
0.72% 30.29% with
0.71% 31.00% by
0.59% 31.60% are
0.58% 32.17% that
0.55% 32.73% from
0.54% 33.26% said
0.50% 33.77% at
0.49% 34.25% which

Individual	Cumulative
8.64%	8.64%	the
4.11%	12.75%	of
3.56%	16.31%	a
2.57%	18.88%	and
2.54%	21.42%	to
2.13%	23.55%	in
1.90%	25.44%	is
0.93%	26.37%	for
0.82%	27.19%	or
0.81%	27.99%	as
0.79%	28.79%	be
0.78%	29.56%	an
0.72%	30.29%	with
0.71%	31.00%	by
0.59%	31.60%	are
0.58%	32.17%	that
0.55%	32.73%	from
0.54%	33.26%	said
0.50%	33.77%	at
0.49%	34.25%	which

Only the last of these is longer than 4 letters, and all but one are closed-class words. Try writing English without using "the", "of", "a", or "and"! Consistent with the nominalisation results, we see literature at one end (62.9%) and patents at the other (49.8%), even more extreme than academic articles. The mean word length goes the same way, although obviously mean word length and proportion of short words are related. As a rule, function words are short and content words are long, and base forms of words are shorter while derived and inflected forms are longer.

Individual	Cumulative
percentage	percentage	Word
8.64%	8.64%	the
4.11%	12.75%	of
3.56%	16.31%	a
2.57%	18.88%	and
2.54%	21.42%	to
2.13%	23.55%	in
1.90%	25.44%	is
0.93%	26.37%	for
0.82%	27.19%	or
0.81%	27.99%	as
0.79%	28.79%	be
0.78%	29.56%	an
0.72%	30.29%	with
0.71%	31.00%	by
0.59%	31.60%	are
0.58%	32.17%	that
0.55%	32.73%	from
0.54%	33.26%	said
0.50%	33.77%	at
0.49%	34.25%	which