Integrated Census Microdata (I-CeM)
Data limitations
This page introduces researchers to sources of inaccuracy in I-CeM data arising from the enumeration and transcription process and to additional information to consider when using variables related to birthplaces, occupations, household structures, and consistent geography.
Problems with census data
The General Register Office (GRO) and General Register Office of Scotland (GROS) had comparatively little time to organise the taking of the census, and some of the agents involved left much to be desired - illiterate householders, slap-dash enumerators, and registrars who did not supervise the work properly. This alerts us to the problematical nature of some of the data in the manuscript returns. The information in the enumerators' books was several stages removed from reality, and each stage could add its own accumulation of errors.
The household schedules that form the 1911 and 1921 returns in England and Wales may be closer to 'raw' data but still contain inaccuracies such as when householders were less than assiduous in determining details of their servants or simply misunderstood the form. Even Sylvanus Percival Vivian the Registrar General in 1921 managed to initially place details of his daughter's education in the wrong column despite having spent months agonising over which questions to ask and how to make instructions as clear as possible.
The forms could be confusing, particularly when the GRO attempted to innovate. One such example is when, in the case of the 1921 census, a new question was introduced in which heads of households were required to put crosses in the boxes in the righthand column of the schedule indicating the ages of their resident or non-resident children aged less than 15. This caused some heads to erroneously enter crosses on the rows pertaining to resident children, and/or provide details for non-resident children (subsequently crossed out again by the enumerator). The situation was further complicated by the census office not allowing for twins or children born with 12 months of one other (and who, therefore, might be the same age in years) when designing the census schedule for 1921.
Not only did householders or enumerators guess at what was asked of them but also at the answers themselves. For example, there appears to be a tendency for stated ages to bunch around 10s and 5s – people knew they were about 50, or in their 50s. Thomas Large of Windsor claimed to be 70 in both 1891 and 1901. Similarly, birthplaces can be inconsistent for the same person between enumerations, or be idiosyncratic. In an age before most people had birth certificates this is perhaps understandable, but the overall effects do not seem to be excessive. Enumerators may not have always understood what they were told by illiterate, and perhaps suspicious, householders. We do not know the full extent of the errors or omissions made by the enumerators in the process of copying the household schedules into their books. There are, for example, some cases of enumerators entering families twice, and no doubt others were missed out.
Some householders may have been reluctant to give embarrassing information regarding the mental disabilities of their kin, especially children. Moreover, the man who described himself as 'Feeble-minded since marriage' may have been less than truthful. Exactly how visually impaired did one have to be to be 'blind', and what if one were 'deaf' but not 'dumb'? The disability question was omitted entirely in 1921 due to 'generally recognised fact that reliable information upon these subjects cannot be expected in returns made by or on behalf of the individuals afflicted'.[1]
There has also been much written about the occupational and employment data given in the census returns, and there were probably problems with some casual and seasonal work. After all, the census recorded occupational titles – what people called themselves – rather than asking for an itemised list of labour inputs. This may have been particularly significant for the work of some women and children, and in agriculture. It is not the job of the I-CeM team to state how reliable or complete the recording of occupational information within the censuses might be for any specific group. These are questions of interpretation and it is up to the users of the data to make their own informed judgements. In doing so, we do, however, point users to previous research on the subject as recorded in the Publications section of this website.
It should not be overlooked that attempting to count an entire nation in a single night produced serious inevitable errors which were impossible to resolve contemporaneously and still probably cannot be fully remedied. While the GRO went to great lengths to ensure that enumeration took place on a day in which the majority of the population would be at home, there were always individuals enumerated elsewhere. Such individuals cannot in most instances be placed back in their "proper" parish of residence, particularly if at sea. This is a significant issue with the 1921 censuses. As early as 1920, administrators had been particularly anxious about the possibility of 'extensive movement of the population', taking care to not arrange the enumeration too close to Easter Sunday.[2] However, coal strikes which began in April 1921 led the census to be delayed until late June. Though this was before the beginning of the usual 'industrial holiday' season in July, 'unusually fine and warm weather' in June meant that towns such as Blackpool experienced a growth on its 1911 population of 64% due to the influx of holiday makers.[3]
Occasionally, householders simply refused to fill in the census. Subsequent fines and court orders remedied some missing entries though others will inevitably have slipped through the cracks of bureaucracy. In most years, such losses will be trifling. However, in 1911 there was widespread censal disobedience by supporters of women's suffrage after the Women's Freedom League called for a boycott. Hundreds of women (and smaller numbers of men) were thus simply never counted, were counted against their will (enumerators simply guessing at numbers of residents), or refused to give full and accurate details of themselves.
There have also, inevitably, been some losses amongst the original returns, which have not always been held in optimum conditions. The backs and fronts of some of the enumerators books have been damaged by storage on unsuitable racking, and in some cases there has been more serious damage and loss. But some of these gaps are being made good, as in the case of the 1851 census returns for Manchester, Salford, Oldham and Ashton-under-Lyne, which were severely damaged by flooding. These have been patiently transcribed by Manchester and Lancashire Family History Society, and these new transcribes are incorporated into the I-CeM data collection.
In short, users of I-CeM should be aware that all historical sources have their limitations and biases, and it is the job of the skilled researcher to take these into account when using them. I-CeM is no different in this regard.
Birthplaces
Coding parishes of birth is clearly not unproblematic (see dataset construction). Due to the nature in which this information was enumerated, it is prone to be problematic, and cannot be standardised correctly without detailed local knowledge. Obviously this is not possible in a project of this nature. Thus the variables associated with the standardisation of birthplace should be treated with caution as undoubtedly mistakes will have occurred. However, it is probable that the variables are correct for some 95 per cent of the individual records processed.
The census question about birthplaces required differing responses, depending upon individual circumstances. In England and Wales in 1851 in the case of those born in England and Wales, householders were to indicate first the county, and then the town or parish of birth. This order was to be followed in all subsequent Victorian censuses. In the case of those born in Scotland, Ireland, the British Colonies, the East Indies or Foreign Parts, the country of birth was to be stated. The term 'British Subject' was to be added to the latter where appropriate. Interestingly, Wales was not mentioned in the instructions on this matter until 1891, when the principality was treated in the same manner as England. Some other minor changes were introduced in the course of the century. In 1861 a distinction was to be made between 'British Subject' and 'Naturalised British Subject'. In 1871 those born in Scotland, Ireland, the British Colonies or the East Indies were to state the country or colony of birth; and those born in Foreign Parts the particular state or country. The 1901 census broke the population down into four groups in the following manner: State the Birthplace of each person…
- If in England and Wales, the County and Town, or Parish.
- If in Scotland or Ireland, the name of the County.
- If in a British Colony or Dependency, the name of the Colony or Dependency.
- If in a Foreign Country, the name of the Country, and whether the person be a 'British Subject', a 'Naturalised British Subject', or a 'Foreign Subject' specifying nationality such as 'French', 'German', &c.
In England and Wales in 1911 those born in the United Kingdom were required to provide the name of the County, and Town or Parish of birth. Those born in any other part of the British Empire were to provide the name of the Dependency, Colony, etc., and of the Province or State. Those born in a Foreign Country were required to write the name of the Country. For those born at sea, the required response was "At Sea". The census of 1921 was similar in many respects.
The Scottish census returns were very similar, although substituting Scotland for England and Wales in the above rules.
Occupations
The I-CeM Occupational Matrix provided in the Metadata section of this website) give the class, order and sub-order in which occupations in these groupings can be found in each of the published Census Reports for England, Wales and Scotland.
Due to the sheer volume of unique occupational strings in the database, the vast proportion have had to be coded automatically. Whilst every possible attempt has been made to ensure the accuracy of the OCCODE variable, some will undoubtedly have been mis-coded. Others, of course, especially for the censuses of 1851 to 1901, could potentially be assigned to one of several plausible codes due to the incompleteness or ambiguity of the occupation string from which the code is derived. It is estimated that OCCODE is 'correct' for at least 95 per cent of individuals with a designated occupation title. It is also important to realise that, pragmatically, all occupation strings had to be coded in isolation to other information relating to the household to which the individuals bearing any given occupational string belonged. Thus, with the exception of 1911 & 1921 where the associated Hollerith code impacts on the assigned OCCODE, all identical occupational strings were coded identically regardless of the age or gender of the individual in question, or where they were enumerated. This is important to realise when considering occupational titles which may differ in terms of what individuals actually did, according to context. Thus, as an example, all those recorded simply as 'FIREMAN' would have been assigned (out of necessity) to the generic code 766, even though firemen could equally have been local authority workers, employed in mines, on railways, or in the potteries, as well as multiple other roles or industries.
The I-CeM dataset also utilizes the Historical International Standard Classification of Occupations (HISCO) - an historically sensitive and internationally valid occupational classification allowing researchers from a variety of countries to communicate with each other and make international comparisons across the nineteenth and twentieth centuries in social, economic and other fields of history. The classification scheme is hierarchical, in the sense that each digit in the 5 digit codes introduces a new level of detail. Codes sharing the same first 1, 2 or 3 digits are considered to be increasingly similar. For example, all people working in agriculture have the first digit 6. The first digit of a code indicates the "Major group" a person's occupation is in. The second digit indicates a "Minor group" distinction. Continuing the previous example, people who have the first two digits "61" are farmers - who may specify what they are cultivating or tending - and farm managers. Thus, as well as sharing the characteristic of working in agriculture they also share the characteristic of being owners or managers. The first 3 digits denote the "Unit group" of an occupation. At the third digit level, we introduce more detail. For example, the unit group "612" indicates "Specialized farmers". Within this unit group, 4th and 5th digit distinctions known as "titles" or "headings" are made. For example, 61220 indicates "Field crop farmers," and "61230" indicates "Orchardists and fruit farmers."
The Hollerith occupation variable provides the Registrar General code for the industry or service with which a worker was connected, as transcribed from the 1911 and 1921 schedules. In 1911, industry codes were assigned by the clerks of the Registrar General Office and marked on the schedule in preparation for keying Hollerith punch cards for tabulation purposes, using punch codes for industries which are identical to the punch codes for occupations. Industry codes were only assigned by clerks in those cases where the occupation and industry categories differed. For example, the given occupation of "boot-maker" working in a "boot makers" would be occupation code 300 or "Boot, Shoe-Maker" (see HOLLEROCC) with no industry code, whereas the given occupation of "errand boy" working in a "boot makers" would have been assigned occupation code 090 for "Messenger, Porter, Watchman (not Railway or Government)" and industry code 300 for "Boot, Shoe-Maker".
However, in 1921 industry or, more accurately, employer codes were assigned to all those in work as householders were required to list not only their profession but also the name of their employer. Industries were given unique codes; these were not incomparable with the occupational codes (e.g. 010 related to agriculture in both) but were far from identical (700 referred to Railway Officials in occupations but to the Navy in industries). (See the 1921 Classification of Occupations and the 1921 Classification of Industries for reference). These codes have not had any correction applied nor have they been enriched in I-CeM.
It is also important to note that the Hollerith occupation codes were completely reassigned before the 1921 Census. Here the census data are particularly impacted by the internal politics of the GRO, separate somewhat from issues of practice in collection or survival. In 1919 the GRO (along with the rest of the Local Government Board) was merged with the National Health Insurance Commission (NHIC) under the Ministry of Health (MoH). The new permanent secretary Robert Morant (formerly chairman of the NHIC) isolated the RG Bernard Mallet through the appointment of a deputy RG Sylvanus Percival Vivian. Eventually Mallet resigned and Vivian was swiftly appointed as his replacement. Those associated with the "old regime" were treated with suspicion, none more so (not necessarily without merit) than Thomas Henry Craig Stevenson who had been responsible for the 1911 occupational coding system. Vivian instead handed responsibility for determining occupational coding for the 1921 census to Eric Waldemar Sorensen (1881-1930), a second division clerk, who had at least worked on the 1911 codes. Stevenson's codes were far from perfect, he himself finding fault in the social class metric he derived from them. The whole process was arguably influenced more by intellectual curiosity (potentially including Mallet's eugenicist interests) than the practical needs of departments such as the MoH. However, the extent to which the codes were revised due to a desire to produce more useful data for the operation of the state or to simply exert the dominance of the MoH , ridding it of Mallet and Stevenson's lingering influence, is unclear.
Household structure
The key to understanding the Hammel/Laslett classification scheme is the conjugal family unit. These are formed in one of three ways:
- by married couples without offspring;
- by a married couple with never-married offspring and/or never-married adopted/foster children;
- by a lone parent with at least one never-married child.
Examples:
In the three following examples, the outer rectangular box represents the household, while the inner curvy line denotes a conjugal family unit. Conjugal family units are not classified in the programs, but their numbers are recorded in the variable CFU.
If there are more than two generations in the household, the conjugal family unit is formed from the youngest generation upwards. No individual can be in more than one conjugal family unit.

The above household of head (male) and spouse, with three children is of type 320.

The above household of a widowed woman, her daughter, son-in-law and grandchild has an HHD of type 410. Note that for the widow the variable CFU will have a value of 0 while that of the other three will have a value of 1. The variable CFUSIZE will have a value of 0 for the widow and of 3 for the other three members of the household.

The final household shown above, containing two co-residing families which are linked by a pair of siblings with no parent present is a known as a frérèche and would receive a HHD of type 540.
Geography
The administrative geography of Britain underwent near constant revision. Not only were Registration Districts sometimes re-drawn but parishes, some of which had existed for centuries, were re-arranged, merged, or split into completely new units. What might have been a clear identifiable parish in 1851 could be unrecognisable by 1901, let alone 1921. To enable researchers to compare consistent "places" over time, a CONPARID variable has been painstakingly developed over a number of years. This began for Scotland with the work of Professor Michael Anderson and for England and Wales with that of the late Sir E. A. Wrigley up to 1911. Most recently, in version 2 of I-CeM, the consistent geography has been extended by Wakelam and Schürer to include the 1921 data as well as to link Wrigley's previously separate 1851-91 and 1901-11 CONPARIDs (as existing in version 1 of I-CeM) and to correct a number of errors.
This variable amalgamates parishes where necessary so that the geographical territory under consideration remains constant over time. So, for example, assume that part of parish A was transferred to parish B between census years. In order to create a consistent geographical unit over time one would need to treat them, not as separate parishes, but as a single entity throughout the whole period. As such, users are comparing "like" with "like" over time. This increasingly leads to ever larger consistent units (and decreases the total number of units) as parish boundaries shifted; indeed significant changes in the treatment of urban areas by 1921 forced the creation of substantial consistent units such as 'Gloucester and surrounding parishes'. However, the consistent geography has never overridden an individual year's censal geography, i.e. each PARID has been maintained. CONPARID is most importantly deployed as the basis of the new download tool released with I-CeM 1921, allowing users for the first time to download data from common places across multiple years of census data for the first time. Users interested in specific parishes can filter out irrelevant data based on the downloaded PARID for each year and those looking for parishes not immediately evident in the downloader can refer to the CONPARID lookup tables, particularly to identify smaller units which ceased to exist during the period or suburbs subsumed by larger towns.
Due to the nature of its creation, CONPARID will contain errors though it is thought from manual checking that these will be few and minor. These are most likely to occur in rural areas, particularly those in which fen drainage occurred in the 1860s and where there may be numerous parishes with similar names. The I-CeM team admit that human errors which have eluded numerous check may be particularly obvious to native Welsh speakers. Most importantly, it is clear that occasionally the older consistent geographies on which this variable is based joined small (in terms of population but also place) nearby but non-contiguous parishes together for simplicity. Every effort has been made to remove these or to include neighbouring parishes to ensure contagiousness but there are some that likely remain.
PARID is the 'parish' unit listed in the various tables published year by year in the GRO and GRO(S) Census Reports. It is, therefore, not consistent over time. Equally, the same named parish in different years may not cover the same geographical territory, due to boundary changes over time.
[1] GRO, Census of England and Wales 1921 - General Report with Appendices (London: 1927), 2.
The Cambridge Group for the History of Population and Social Structure