# 11 Data processing

© Stephen Robertson, CC BY 4.0 https://doi.org/10.11647/OBP.0225.11

In the United States, a census of the population is undertaken every decade. The 1880 census took 8 years to analyse fully, and the Census Office badly wanted to be able to complete the analysis of the 1890 census in a very much shorter time. The solution to this problem was developed by Herman Hollerith.

## Tabulation

Every analysis of the census data involved sorting returns into categories, counting the number of returns in each category, and recording the resulting numbers. Any of the answers to the census questions, singly or in combination, might be the basis for categorisation. This process of sorting and counting can be described as tabulation, the creation of (printed) tables.

Hollerith’s system involved punched cards—using ideas from Jacquard, whom we encountered in Chapter 7 (he probably did not know Babbage’s work). Each individual census record (representing a person) would be encoded onto a punched card. Each census question would be associated with a group of punch positions, and the individual answer encoded as holes punched in some of these positions. Thus for example the age question had positions for different age-bands. For the gender question there were two positions, M and F, and a hole was punched in one of them. You might think that this datum needs only one position, where the absence or presence of a hole would indicate male or female. But this would assume (a) that this question is always answered one way or the other, and (b, more seriously for the Hollerith system) that the counting mechanism is just as able to count absence as presence of holes. Well, alternatively you could do something involving subtracting numbers in the final table. At any rate, Hollerith did use two positions for this and other yes-no questions, despite the limited number of hole positions at his disposal.

Hollerith developed a machine called a tabulator, into which an operator would insert each card in turn. The machine would automatically operate a number of counters, incrementing them as appropriate for each card. A counter could be associated with a single punch position, thus recording the number of cards with holes in that position—for example, a counter for each age range. But the machine also allowed questions to be cross-tabulated—counters could be set up for combinations of holes. In the first version of the machine, these combinations were pre-set, but subsequently Hollerith developed a plugboard that allowed new combinations to be set up. This effectively made the machine, at least to some degree, programmable.

The operator places each card in turn in the machine, and the appropriate counters (dials) are incremented, according to the holes punched in the card.

The cards that Hollerith used for this purpose were the same size as the then US banknotes. This was a deliberate choice by Hollerith, partly because it enabled him to get cheap boxes for his cards! Although the size of US banknotes has been reduced since then, the card size remained the same, and is enshrined in an international standard.

## The uses of tabulation

Hollerith’s invention allowed the 1890 US census analysis to be completed in two years, and under budget. It was very quickly taken up for many other national censuses. But it also became clear that it could be used for many other purposes, allowing the analysis of different kinds of data in ways that would not have been feasible when everything was done manually. The invention begat an industry, which became known as the tabulator industry; the analysis of data using tabulators and all the associated machinery became known as data processing.

The system was adopted by insurance companies and others. Already in the mid-1890s, one of the US railroad companies used the Hollerith system to compile statistics relating to freight traffic, and to audit the accounts. Other businesses followed suit. The range of applications included account transactions, payroll accounting, inventory management and billing.

For over half a century, tabulator-based systems were used for a huge variety of data-processing purposes, by a huge variety of companies and organisations. When I started working in the field of information retrieval at the end of the 1960s, there were still companies using tabulator systems for searching scientific databases (in something akin to the way we search the web using a search engine such as Google)—though inevitably by that time they were being replaced by computers.

In the meantime, punched cards had developed a new lease of life as an input mechanism for computers—as we shall see below.

## Unit records

Back in Chapter 6, we discussed the idea of a database. A collection of data of the kind for which tabulators were used, such as a set of census returns, is a kind of database. It’s worth exploring its character in database terms.

In my description above of census data, I talked about records relating to each individual: one punched card represents one person. In this case, we might consider it to be a flat database, that is one with just a single type of record. However, census data is typically a little more complex than that. In particular, data is normally (at least currently in the UK) actually collected by household, rather than by individual. A single household with a single address quite likely includes more than one individual.

In a modern computer database, the organisation of the data would probably reflect this, by having each individual record linked to a separate household record. So the database would have (at least) two different types of record, together with a linking mechanism. The household record would contain data such as address, and the individual record would give age, gender, etc. Whether or not census data is actually analysed in this way, it is easy to imagine questions one might ask that cut across these record types: for example, ‘tabulate households by the number of school-age children in each’.

A flat database of unit records of a single type, such as that built on Hollerith punched cards for the 1890 US census, may serve very well for certain kinds of analysis but does have limitations. It would certainly be possible to derive two different sets of unit records from the same basic data: for example, by punching a card representing each household—and keeping these separate from the individual cards. But the linking would be difficult, and some analyses would not be possible.

All the many applications of tabulator technology, from the invention of the Hollerith system until they died out in the 1960s, had to be approached with these limitations in mind. Thus each depended on a choice of a unit as the basis for the unit records. Separate sets of unit records might be created for different kinds of unit, to be analysed separately, but not to be linked subsequently, except by hand.

Despite such limitations, tabulator systems were adopted widely. Hollerith formed a company to exploit his invention. Hollerith’s company merged with three others to form CTR in 1911; in 1924, the merged company was renamed as International Business Machines—IBM. In the fifties and sixties, IBM redefined itself as a computer company; in the 80s it developed the PC (not the first personal computer, but one of the most successful commercially). But before that, still in the first quarter of the century, other rival tabulator companies were formed, and developed in other ways. For example James Powers, who had taken over the construction of census-analysis equipment within the US Census Bureau, formed the Powers Tabulating Machine company in 1911. This eventually merged with the typewriter company Remington, which we have already encountered, as Remington Rand.

The development of new business and commercial rivalry went hand-in-hand. JoAnne Yates, in the article Co-Evolution of Information Processing Technology and Use, has a fascinating discussion of the relationship between the insurance industry and the tabulator companies. Many technical innovations were driven by the demands of the insurance companies, who tended to play off the rival tabulator companies against each other, in order to get the developments they needed.

Some of these particular innovations are:

• the ability to accumulate (sum) numbers encoded on cards, as well as simply counting them (this eventually evolved into full-function arithmetic);
• the ability to sort a deck of cards into categories (Hollerith’s tabulator included a rather simple aid to human sorting, but the insurance companies wanted a much faster process);
• the ability to print, generating reports automatically (Hollerith’s design required a human operator to transcribe the numbers indicated by the counters); and
• the ability to encode, and therefore print, alphabetic characters (e.g. names and addresses) as well as numeric ones.

In the third case, for instance, the Powers company had a working printing device considerably before IBM. However, IBM would eventually catch up with a good product, and its considerably greater commercial muscle enabled it to dominate the industry for many years. The last case was initiated when one of the UK insurance companies took over the UK arm of the Powers company.

Naturally enough, the first (and virtually all subsequent) alphanumeric keypunches used the Sholes QWERTY keyboard with which we are already familiar. And indeed, in 1928, IBM bought up one of the earliest electric typewriter companies, Electromatic, and became a major player in the typewriter business, eventually coming to dominate the market in high-end office electric typewriters, which effectively carried it over the transition to computers.

## The transition

The continuity between the tabulator-based data-processing world and the computing world is remarkably strong. For one thing, at least as far as IBM computers were concerned, punched cards were used for data input, both for the data to be analysed and for program instructions. This meant that the keypunches and the card feeders and the card readers, as well as the cards themselves, could be repurposed, and trained keypunch operators could be reassigned.

But even outside the IBM environment, the conception of the business uses to which a computer might be put started from what data processing had done before computers came along. Of course companies innovate, devise new ways of using the resources that they have, including computers. But these innovations seldom come out of thin air, and often involve doing something that you are doing already, but in a better or more flexible or more reliable way. And besides, those functions that are most susceptible to mechanisation using tabulators are also those functions that are most obviously computerisable.

In the late 1940s, after the end of Second World War, academics took up the idea of developing computers—some sense of what might be possible having survived the total destruction of Bletchley Park (see the next chapter). They of course had many ideas about what might be done with computers, to what uses they might be put—many of them revolving around calculation, since the people who worked on the computers were often mathematicians, physicists and engineers. Applications included astronomical calculations, weather forecasting, mathematical modelling of engineering structures or chemical processes, and so on.

But as a business, operating in the world of business, computing could only take off by starting somewhere that other businesses might recognise as useful. That starting place was the existing tabulator industry and its existing applications. As we have seen, these applications were not primarily about calculation. They were about sorting, storing, manipulating and managing records.

## Data

I have talked about databases, but it is worth thinking a little more about data. In the Jacquard loom, a hole in a card represents a control mechanism for a machine, and it is really only in hindsight that it looks like a coded representation of the woven pattern. Babbage’s notion (which of course was never made concrete) was that such holes could encode not only numerical data (to be operated upon), but also the arithmetic operations that might be applied to them, and indeed therefore the sequence of operations and the rules for branching, which we would now call the program.

Such abstract considerations were probably far from Hollerith’s mind when he developed his system, but he did have a notion that they could encode characteristics of people. One of his sources of inspiration was a system in use on the railways, where the conductor would punch each ticket with codes to identify the passenger. Another source was the piano roll, a continuous roll of paper in which slots were cut to represent and reproduce notes played on the piano, an almost-digital representation of music (it’s not properly digital—the length of each slot is an analogue representation of the time for which each note is held). Thus there is the beginning of an idea that there is some kind of abstraction that we might call data.

Over the next half century or more, this notion will develop slowly. In particular, the idea is centred around the association with characters that may be typed or printed. And we know about characters—we have letters and digits. Oh, and special symbols and punctuation marks. Oh, and we discover that we need to extend the notion a little, to include first spaces, then carriage returns and line feeds (the ability to print a three-line address from an IBM card came really late in the day). In fact, basically anything that can be a key on a typewriter—as typewriters were electrified, they were also given a key for carriage return, as opposed to the lever on the left-hand-end of the carriage, which was the norm for manual typewriters.

Despite Morse and more than a century of telecommunications, it was not until the 1960s, well into the computing era, that the idea would emerge that we might need a standard form of encoding for all these characters, for all the different data processing and telecommunications purposes. Then two different standards appeared: ASCII, which was explicitly designed to be a standard, and EBCDIC, which became a de facto standard because of its use by IBM (we encountered these in Chapter 5). Both standards take as their basis the kinds of characters that one sees on a typewriter keyboard.

As with numbers, the internal storage and manipulation of such data in computers does not necessarily retain the direct link to printable characters. We have already seen that numbers may be converted and then stored and manipulated in a number of different ways—the same applies to other kinds of data. This was already true of the original Hollerith card—the gender question is represented by two hole-positions called M and F, rather than by the words Male and Female. But many other such transformations may be made inside a computer, even if the external representation (input from a keyboard, or output to a printer or a display) is always made using the words with which we are familiar.

Now, finally, we can think of all text and all numbers as data, and in particular as digital data, storable, transmittable and processable in a large variety of ways, but essentially all by the same kinds of machines. It will take a little longer for the idea of data to encompass images and music as well, but that will happen eventually, as we saw in Chapter 7. Furthermore, not only can we think of text and numbers and photographs as data, in some sense we have to think of them this way. As with so many other influences of technology, there is no going back.