Integrated Census Microdata (I-CeM)
Dataset construction
The I-CeM data collection results from a number of interlinked stages of construction. Since it is important that users of the data understand the processes which have been undertaken in order to create I-CeM, these are described in outline below.
Enrichment variables and households
Reconciliation
In order for researchers to conduct spatial analysis (analysis of variables by place), observed individuals within the analysis need to be assigned to a particular place.
It was first necessary to establish an 'enumeration geography' for each census year drawing on the published population tables in the decadal Census Reports. The result is a hierarchical geography consisting of Registration Districts (RD), Registrations Sub-Districts (RSD, except in Scotland) and civil 'parishes'. Users new to the census should be aware that whist any given RD or RSD might bare the same name over time, it may not necessarily retain the same area due to fluctuations in population size and density over the period. Likewise, 'parishes' in I-CeM (as identified by the PARID variable) do not necessarily relate directly to civil parishes. Particularly in urban areas, some large civil parishes might be split across several RDs or RSDs, with each split being allocated a separate PARID in order that the RDs and RSDs in question can be reconstructed, if required.
Further details on attempts to mitigate limitation related to consistent geography between censuses are detailed in the limitations section, particularly the process of assigning a CONPARID variable. Users should also refer to the detailed discussion of CONPARID available in the case studies section of this website.
Individuals within the raw I-CeM data were subsequently allocated to their correct 'parish' of enumeration. This is done by comparing the 'observed' (I-CeM) population to the 'expected' (official Registrar General census report) population. While in theory matching the enumerated census data with the published tables ought to be a simple process, this important process proved hugely time-consuming and problematic.
In order to match the data to the published reports, since the information held at the enumeration district level was often insufficient to identify the administrative census district and parish, the geographic data recorded at the top of each census page (for 1851-1901) had to be used to reconcile the data to the published returns, page by page. Initially a program produced a summary of the information recorded for each page, adding up the number of individuals by page and by enumeration district where necessary. This page-level information was linked automatically using a combination of the geographical information recorded on the page headings to a specially created machine-readable version of the published parish population totals, year by year, for each country of enumeration. This enabled the re-composition of the parish, and calculated 'observed' populations from the raw data to be compared to 'expected' populations as recorded in the published tables. The raw data was considered to be reconciled to the published figures whenever there was a match on administrative geographic identifiers and population totals in both sets of data.
The raw data was not always correct with respect to 'place' because:
- the combination of registration district, registration sub district, and parish in the raw data was not valid (e.g., variations in spelling or spelling mistakes);
- the wrong combination of registration district, registration sub district, and parish had been recorded in the raw data
- raw data are missing, or 'wanting'.
Concerning the last of these, it is important for users to realise that not all census records have survived. This is especially true for 1851 and 1861 for which whole parishes or even whole RSDs are 'wanting'. Further details on missing data is available from both The National Archives and FindMyPast websites.
As part of the reconciliation process, discrepancies of less than 0.5% of the expected population were ignored, and other discrepancies made good, as far as was possible, by the time-consuming process of manually reallocating pages from parishes in 'excess' to those in 'deficit', and vice versa, in order to produce the best optimal fit. It was assumed that the lowest level unit of reconciliation was the census enumeration book page because each page of entries should (in theory) be within one administrative parish. However, this was discovered not always to be the case.
It is important to note that whilst every effort has been made to link as accurately as possible the 'raw' census as received from FindMyPast to the administrative geography as set out in the various published returns, the outcome is not, and indeed cannot, be totally accurate. This is due to a host of problems with the underlying nature of the way administrative geographical information is recorded on the pages of the Census Enumerators' Books (CEBs).
As mentioned above, a significant factor relating to these discrepancies are enumeration books, or parts of books, missing from the raw data. It must be remembered that the archival record is itself incomplete. This is true of all years, but especially in the case of the English and Welsh Census of 1861. Equally, the transcriptions made available to the project do not in all cases include 'ships' enumerated both in coastal waters and in ports. This, again, is because these have not always survived, or were not transcribed. This may well explain the differences between 'observed' and 'expected' population totals for coastal parishes.
In the case of the 1921 census of England and Wales, the task of reconciliation was made more problematic due to two additional unrelated problem.
In all census years enumeration districts were determined as much by population rather than geography given that each represented the work of delivering and collecting schedules in a day deemed feasible for a single Census Enumerator. The period between the 1911 and 1921 censuses was one of significant reform in the administrative geography of Britain, some of this occurring simply at the level of the GRO (including the merging of small rural parishes with a single enumeration district (ED) for official convenience) and some at a legislative level through the creation of new towns. Determining exactly where such splits occurred within an ED usually required manual checking and was extremely time-consuming.
Second, in transcribing the data TNA/FMP decided to include several thousands of 'null' records. These are the records of individuals who are recorded in a household despite not being present on census night (maybe on holiday given the postponement of the census to June) as well as hundreds of pets (such as 'Humourist' the Tortoise in West Ham, assigned the occupation of 'Eats Slugs'). Such entries were dutifully crossed out by the enumerators, but despite this the information was transcribed without any marker to indicate that this was an invalid record. Trying to identify such records has not been without problems, and it is clear that several will still exist within the data.
Note: since the 1851-1911 data were first deposited with the UKDS, a number of modifications have been made to the initial parish reconciliation allocation and new PARIDs assigned. This includes the identification of empty or null records in the data and duplicated records. Look-up tables providing details of these records are available from the MetaData section of this website. In addition, resulting from the ESRC-funded British Business Census of Entrepreneurs project headed by Professor R. Bennett at the University of Cambridge, which drew heavily on the I-CeM data, a paper has been produced which details gaps in the I-CeM data and proposes weights to adjust for these.
Standardisation
Standardisation is the major element of the work in creating I-CeM. The original census enumerators' books (and schedules in 1911 and 1921, England and Wales) are essentially textual documents. Most of the information contained within them takes the form of textual strings. Due to variance in the ways in which the same information is recorded textually (e.g. Bricklayer, Brick Layer, Bricklaying etc.), in their raw form these strings are almost impossible to analyse comprehensively unless some form of standardisation is undertaken. In tackling census transcriptions covering the entire country, this trouble becomes immense. The following table of data from the censuses of 1851-1911 illustrates this:
|
Variable |
n. of unique strings |
% of strings with frequency of 1 |
% of strings with frequency of 5+ |
|
Relationship to household head |
95,479 |
60.3 |
18.0 |
|
Marital status |
7,822 |
60.7 |
17.3 |
|
Disability |
59,204 |
73.3 |
10.2 |
|
Language (1891-1911 only) |
3,094+ |
1.0 |
62.7 |
|
Building type (1911 E&W only) |
71,581 |
4.6 |
60.0 |
|
Employment status (1891-1911 only) |
66,379 |
29.3 |
51.5 |
|
Occupation |
7,304,708 |
77.7 |
7.2 |
|
Birthplace |
6,703,779 |
70.2 |
10.6 |
|
+ Note: Language entries for Scotland were mixed in the transcription with the birthplace fields, so this figure underrepresents the true figure. |
|||
Not only are there many strings to standardise – in the case of occupation and birthplace running into several million; the transcribed version of the 1921 census of England and Wales, has some 5.5 million unique occupation strings and 4 million birthplace strings alone – but a high proportion of these were low frequency and thus prone to be rather idiosyncratic in nature, making the task of coding all the more difficult.
The solution differed across the variables or strings in question. Several queries such as to marital status, relationship to head of household, or sex produced only a (relatively) small number of variants and could be easily coded manually. Additionally, many of the more simple queries were asked repeatedly, enabling data dictionaries to be re-used and grow for each subsequently coded year (beginning with 1851 and ending with the 1921 Census).
Full manual coding for variables such as occupation or birthplace was not an option due to sheer volume. For this a set of automatic/semi-automatic processes for coding had to be devised.
Occupations
Occupations were coded to a bespoke system aligned with contemporary categorisations to provide standardisation and enable comparison with the nineteenth and early twentieth century Registrar General publications.
OCCODE is rooted in the coding system developed for the 1911 Census. Not only was 1911 the first year for which schedules rather than CEBs were available to the project, but it was also the first time in the UK that Herman Hollerith's coding system (previously used extensively in US censuses) was used, the punch card codes being written by census officials onto the schedule. Occupations cited in 1881 were mapped to the 1911 coding system; remaining occupation strings were then assigned a code algorithmically. This took the form of a number of stages. Initially strings were 'cleaned' by removing non A to Z characters and compared as whole strings to the stock of coded strings using a number of different techniques, including phonetic based comparisons, K-approximate string-matching algorithms, and general editing distance comparisons, such as Levenshtein and Jaro-Winkler distances.
Whilst bringing some success, due the complexity and variance of the strings, an additional approach also needed to be taken. Each string was broken down into composite "words" and thesaurus of all "words" associated with each OCCODE was created; words within each code were also weighted to indicate the "importance" of the word to that particular code, to other codes, and to the sequence or order in with the words occurred with the given strings. In this way the words of the uncoded strings where compared to the words of the coded strings, both like for like and comparing words using n-gram matching, in order to predict the most likely code for the given string.
The introduction of the 1921 data in the most recent edition of I-CeM presented a further complication. The number of unique strings present for the 1921 data exceeded 5 million for a variety of reasons not limited to the simple size of the box on the schedule for this question, encouraging householders to elaborate / pontificate. Additionally, while the recently re-organised GRO continued to use Hollerith's coding system to ease tabulation of data, the logic under which occupations were classified was substantially altered (see Census History 1921). The 1921 system distinguished between individuals primarily based upon their role rather than their industry, such as placing all retail business owners in a single code whilst foremen, owners, workers, clerks, or packers in various industries were separately defined. The two systems are not, however, incomparable. Given the fundamental differences between the construct of the 1921 occupational classification system and those for all previous censuses, it was decided to place emphasis initially on the occupation Hollerith codes which had been transcribed alongside the occupation strings since these should allow users to undertake analyses of employment in line with the classification system devised by the Registrar General's Office in conjunction with the 1921 census.
Examining the 1921 occupation Hollerith codes for 1921 (England and Wales) identified that whilst many codes were transcribed which did not match the list provided in the RG Report, this only impacted c. 1.5% of strings. The vast majority were assigned to three codes which were valid at the time of schedule markup but were later disposed of before publication of the report (e.g. 048 Colliery Checkweighmen). However, this still suggested the probability of a not insignificant error rate by transcribers, indicating some with valid codes had been assigned to occupations incorrectly.
The original transcription was preserved but an additional field called HOLLERNOS was devised for 1921 to provide an accurate 'likely' Hollerith code. This is different from the RG report not least because it cautiously places uncertain strings in 989 (All Other Occupations). Common strings coded to invalid strings were compared with the most occurring combination of that string with a valid code (e.g. Bricklayers coded to 375 were recoded to 575). However, the code variance subtlety unique to 1921's focus on rank within trades (such as Bricklayer's Labourer 576) was preserved by maintaining codes that might differ from the expected code but only in the final digit (e.g. where a Bricklayer was coded as 576). Rare unique strings (e.g. those occurring fewer than 50 times across the census) were matched to common string-code pairs using approximate character matching techniques of increasing leniency in similarity. Where multiple matches were made (which occurred in almost all string searches) the valid code match closest to the transcribed code was selected under the assumption that the transcriber was most likely to make an error in only one digit (e.g. mistaking 987 for 907 rather than 575) or placing number in the wrong order (e.g. 757 instead of 575), followed by the strength of string similarity, and finally (if necessary) the frequency at which a pair occurred. Any strings occurring more than once and still unmatched (all possible pairs having been discounted due to low similarity) were manually coded. This dictionary of "valid" coding was used to create a thesaurus of terms connected to each Hollerith, subsequently enabling an algorithmic verification and (where necessary) recoding of the remaining corpus of occupations.
Having created the HOLLERNOS variable for the censuses of 1921, this was then use as a bridge to link the related occupational strings to OCCODE, which in turn, allows the 1921 census data to be analysed in a comparable way to earlier censuses.
Birthplaces
Birthplaces presented a new and different set of problems and challenges. Birthplace in the CEBs was usually recorded at three levels – parish of birth, county of birth, and country of birth (mainly for those living outside of their country of birth) – translating to the three variables BPCMTY, BPCNTY and BPCTRY. Whilst the data for 1851-1901 had been transcribed according to these three levels, the order in which the information is recorded in the enumerators' books need not necessarily conform to these three levels. Thus, parishes may be recorded in the county or country fields, and vice versa. The schedules of 1911 and 1921 – containing the personal response of the householder – further increases the confused order in which birthplaces are described and, in 1921 in particular, might be absurdly vague or precise (specifying street addresses or even room within a property). A further complication with 1921 is that the information was transcribed as two rather than three fields (BPCMTY and BPCNTY - which were often intertwined) together with Nationality, a question which had been introduced earlier in 1911.
Birthplace was coded according to geographical hierarchy – country first, then county, then parish. Having assigned a county code (BPCNTI) to each birthplace string (combining the three birthplace variables) it then meant that parishes could be coded within county groups. This was supported by the creation of a parish-level dictionary covering each (ancient) county in England, Wales and Scotland. In creating this dictionary, it was important to include not only all civil parishes, but also a sub-parish level field linking to the parent parish, as it is clear that the information recorded in the censuses does not always relate to a civil parish. So, for example, the parish of Hatfield Board Oak in Essex contains two distinct hamlets, Bush End and Hatfield Heath. Whilst not technically parishes, these designations may (and do) occur in the census returns as they were how people thought of the places in which they lived and therefore need to be associated with the parent parish. This issue of sub-parish level information is particularly relevant in a number of northern counties where the ancient civil parish may cover a wider geographical area with several distinct settlements or townships. In order to address this a gazetteer or dictionary of parishes with associated place names was constructed from a variety of sources, including the 1911 census report, which lists a large number of sub-parish settlements and relates them to civil parishes, the 1971 OPCS Gazetteer of Place Names, the Ordnance Survey Gazetteer and GB1901.
Two additional problems also needed to be taken into consideration and incorporated into the dictionaries. The first of these is name variation and change. To continue the example of Hatfield Board Oak, this was historically known also as Hatfield Regis due to the royal forest which historically made up a large part of the parish. Linked to this is the related issue of name standardisation, which is a particular feature for Welsh parishes in the nineteenth century, and to a lesser extend Scottish parishes as well.
The second problem relates to agglomerations of parishes, especially in the case of urban areas. So, to take another Essex example, prior to its formal creation in the early 20th century, the town of Southend-on-Sea was not a single parish but rather an amalgam of four civil parishes: Prittlewell (where it had its origin in the 'south end' of the parish), Leigh, Southchurch, and Eastwood. Obviously the same is true of many urban conurbations throughout England, Wales and Scotland, and the dictionary had to include these 'places' as well as their composite parishes. Once country and county had been appropriately allocated, then a similar approach to the coding of occupations was applied to birthplaces – strings were compared to those in the compiled dictionary, initially as strings and then as word combinations, in order to predict the most likely candidate parish.
When a parish name appeared to be spelt correctly and was a "proper" parish name but was listed in the wrong county – e.g. Colchester, Suffolk – then parish trumped county. This example would be standardised as Colchester with the county code (BPCNTI) of Essex. If a 'correct' parish or place name occurs in more than one county then the nearest (using centroid distances) option was selected and the variables assigned as in the example above.
An additional difficulty in standardising birthplace data was identified for the 1921 England and Wales census data as during transcription, the creation of derived variables for the standardised places had been done uncritically. Over emphasis on matching key terms without verifying their relationship to the place of enumeration or even terms related to county within the string. As such, all places containing the term 'Stratford' had been placed in London including those born in Stratford on Avon. Similarly, those born in Great Yarmouth had been assigned to Yarmouth on the Isle of Wight. Birthplaces thus had to be re-parsed and, for almost 100,000 birthplace strings, manually reassigned.
Standardisation Summary
It is important to realise that whilst every effort has been made to ensure consistency across all standardisation, the coding is not and cannot be 100 per cent accurate. Mistakes will undoubtedly have been made. In part this is due to the fact that by its very nature, coding is a subjective exercise. Decisions over how an ambiguous string should be classified will vary from person to person. In addition, for straight forward practical reasons, all strings largely had to be coded 'blind'. That is to say, strings were generally coded in the aggregate as strings, absent from any contextual information about the individual taken from the rest of their census record. Whilst this may not be important in the majority of fields, or cases, it may have significance in the case of occupations and relationships. With the former, because any one dictionary entry can only have one code (in the case of occupations, 1851 to 1911), an occupation title which could have more than one meaning will only default to a single code. In the case of relationships, a simple string such as 'son' might be nuanced by the familial situation in which it is recorded, perhaps in reality being a stepson or a grandson if there is an intervening generation. This problem, however, was addressed by a series of programs which classified households by taking all individuals within the household into consideration and re-assigning relationship designations if appropriate (see section on the enrichment process below).
Because of this ambiguity around standardisation and coding, it is important to realise that during this process when codes were added or data is 'altered', the original strings from which the codes were derived are still preserved as separate variables within the database. Thus users can recode or reclassify should they wish to do so.
To our knowledge, the coding dictionaries created as part of this standardisation exercise represent the most complete and comprehensive classification performed on historic census data.
Reformatting
Whilst similar to standardisation in that it produces a new variable which adds value to the original variable from which it is derived, reformatting in the context of this project differs in that it is not classifying the original variable in a coded form. The key variable to be reformatted was age whereby character ages were transformed into a numeric equivalent (as far as possible). Thus an original age of '6 months' was transformed into 0.5, '18 months' into 1.5, and so on. Other variables which were reformatted included address, schedule and the archival census reference.
Consistency checking
Consistency checking is possibly more contentious because the program automatically alters the coded data if the rules that have been created are implemented. However, as stated above, this process is tracked by the allocation of an 'inference' variable and therefore can be 'undone' should it be deemed desirable by others. One of these consistency checks, for example, evaluates the relationship between the three variables: sex, relationship, and first name. It is important to first note that, prior to 1921, on the original schedules there were two columns for age; one for age of males, the other for age of females (from 1921, a specific sex field was included in the schedule). Quite frequently the enumerator when completing his schedule put the age of the person (inadvertently) into the first of these two columns regardless of the actual gender of the person, this leads to a greater number of males than females. The program checks that each individual has consistently gendered variables, i.e., someone whose first name is female and whose relationship is feminine should also be recorded as female. A dictionary of first names which has been checked manually has been constructed to test whether first names are masculine or feminine - initials are excluded - and the relationship code, which is often gender specific (i.e. wife can only be female), are used to test the entry in the sex variable. If both first name and relationship indicate a different gender to that in the sex variable, the sex variable is altered. If there is still a problem, i.e., the relationship is not gender specific (i.e., head) then the sex is not altered.
The main consistency checks and alterations can be summarized as follows:
- If marital status is unknown and age is less than 26 then marital status is set at unmarried.
- If marital status is unknown and relationship is one of child to the head of household then marital status is set at unmarried (This excludes in-laws.).
- If marital status is unknown or single and relationship is 'wife' then marital status is set to married.
- If marital status is married and age is less than or equal to 15 then marital status is set to unmarried.
- For those people where the relationship to head of household is gender-specific, alter the sex if relationship and first name refer to a different sex than the sex variable.
- For those people where the relationship to head of household is gender-specific, alter the sex if relationship and first name refer to the same sex and the sex variable is unrecorded.
- Unmarried 'in-laws', if 17 or under are reclassified as step-children.
- If relationship is a generation above that of the head and age is less than or equal to 15 then the age is set to missing.
- If relationship is two or more generations down from the head and age is greater or equal to 55 then the age is set to missing.
- If relationship is two generations above that of the head and age is less than 28 then the age is set as missing.
Inference variable or flag
As mentioned above, the enrichment process created an 'inference' variable which flags particular records where the computer programs have altered the coded value of the original string entry because of certain rules which have been operationalized within the given program. With the current release of the data, these flags only show that some alteration has been made. With the possible exception of the data relating to head of household, it should usually be fairly obviously how and why this information was changed.
The enrichment process and households
Households are a critical element in any social or economic research conducted using census data. However, despite the importance of identifying households, in the case of the Victorian and Edwardian censuses the task is not always so straight-forward. The problem is principally caused by two interrelated factors. First, the definitions issued by the General Register Offices in Edinburgh and London as to what constituted a household were ambiguous. Second, both enumerators and householders alike could differently interpret these definitions.
In England and Wales in 1911, for example, the instructions to enumerators gave the following direction regarding persons who were to receive separate household schedules:
- Every Head of a Family occupying the whole or part of a house or flat.
- Every separate Lodger occupying a room or rooms in a house or flat and not boarding with a family in the house (when two or more lodgers share a room or rooms they must be treated for Census purposes as a "Family").
- Every Resident Caretaker of a House to be let, of a Shop or of other Business premises, or of a public building.
- Every Outdoor Servant (with or without family) occupying separately any building or rooms in a building, such as a Lodge, Gardener's Cottage, Dwelling Rooms over a Coachhouse or Stable &c., which is detached from the house to which it belongs or has no internal communication therewith.
- Every Resident Proprietor, Manager or Head of a Hotel, Club, Business Establishment, School, &c., unless the Registrar has notified you that he has appointed such person to act as the Enumerator of the establishment.
These precepts were the outcome of a lengthy process of evolution over the course of the previous 60 years which subtly altered what a census 'household' meant (see Higgs, Making Sense of the Census Revisited, 72-4).
The first of these instructions suggests that the conjugal or biological family was central to the London GRO's notion of what constituted a 'normal' household, but the clarity of this definition is confused by the situation of lodgers and the attempt to distinguish between those who were integrated with the main family through the sharing of meals (or 'boarding') and those who formed an independent social and or economic unit. This distinction was largely lost on some enumerators and householders, who did not find it possible or desirable to define households in such a way. It should also be remembered that the definition of the house in Scotland was rather different to that south of the border. In short, the theoretical definitions concerning households issued by the GRO did not automatically translate themselves into workable practical definitions as perceived on the doorsteps of Victorian or Edwardian Britain.
In order to overcome these problems and enforce a measure of consistency with regard to the definition of households within the data, a number of complex consistency checks and corrections were undertaken automatically. In attempting to address this problem of inconsistency Michael Anderson has recommended that researchers ignore the allocation of schedules altogether and concentrate on the column containing information on the relationship to household head, treating as a household all those individuals listed between one head and the next.
Anderson's simple rule regarding household formation[1] has largely been followed in the I-CeM enrichment program, with the following key changes:
- If two heads are recorded within the same 'original' household and the second of the heads shares the same surname as the first head and the address for both 'heads' is the same then the relationships of the second group are changed as appropriate to form one single household. Otherwise, the second group is split from the first to form two distinct households.
- If a 'specific' property (e.g. 69 Windsor Street, as opposed to Windsor Street) has been allocated two schedules, or the enumerator specifically specifies a sub-schedule (e.g. 5A or 5 1/2) then the individuals in question are treated a lodgers or familial lodgers.
- If an 'original' household has no head and the group consists entirely of servants and/or visitors and/or lodgers and/or boarders, and the address is the same as the previous household then it is joined with the previous household and relationships changed as appropriate, to ensure consistency. Otherwise the first person of the 'original' household is created 'head' and subsequent relationships changed accordingly.
In joining households, the new household always takes the household identifier (HID) of the first household in the group being joined. If an 'original' household is split to form two or more new households, the new household(s) split off from the original are allocated a new HID value.
The enrichment process also sought to identify what has been termed 'shifting headship'. This occurs when relationships within a given household are defined in terms to an individual other than the head, rather than to the head of the household. In such cases the relationship codes for those with 'shifted' relationships are changed as appropriate. An example is as follows:
| Original entry | 'True' relationships |
| Head | Head |
| Wife | Wife |
| Son | Son |
| Son | Son |
| Wife | Daughter-in-law |
| Daughter | Grand-daughter |
| Lodger | Lodger |
| Daughter | Lodger's daughter |
The enrichment process and institutions
The problem of the definition of institutional records is important for the correct analysis of households. Essentially an institution can be defined based on living arrangements, as an establishment whose residents, other than those described as visitors, are normally catered for communally rather than cooking for themselves. These institutional residents are not considered to be attached to any household and, ordinarily, are only included in the total counts of population.
Thus the main problem is to correctly identify institutions and those who are resident within those institutions. A further task, which is not quite so important, is to correctly identify the relationship between the people within institutions.
Despite the carefully worded definition given in the opening paragraph to this section there are still problems with defining institutions. The GRO issued defined a number of establishments to be listed by local Registrars, the 20 listed in 1921 included Workhouses, Schools, Asylums, Nursing Homes, Hospitals, and Barracks. Furthermore, all those establishments which Registrars presumed were likely to contain at least 100 individuals on census night and 'every Registered Lodging House' were to be given institutional schedules.
The same instructions specified who might act as Enumerator in these cases. In 1911 it was ordered that in the case of (a), "an Institution containing 100 or more inmates[,] the Chief Resident Officer must act as the Enumerator" and in the case of (b), large establishments containing 100 or more inmates, the Registrar could appoint the Enumerator at their own discretion.
On this basis the distinctive characteristic of an 'institution' is that it has more than 100 people and is enumerated by the Chief Resident Officer, or head, rather than a 'regular' enumerator. However, in the censuses of 1851 to 1881 in England and Wales the threshold had been set at 200 persons, and in Scotland in 1871 it had been 135. There were thus two definitions of institutions - one based on living arrangements, the other an official one where the size of 'institution' was the main basis of definition, although the threshold for inclusion might vary.
The problem of identifying institutions is made more problematic due to the fact that in some instances (especially in the records for Scotland in 1901, for example) the transcript records neither a schedule number nor the written description of the institution. Thus, in some cases, it is not entirely clear where one institution ends and another starts. In such cases, a pragmatic decision has been made based on information in the relationship to head column.
It was decided that a definitive solution to these households was necessary but likely to be impossible without checking each 'institution' by hand, and making decisions based on the household structure. It was decided, therefore, that it would be better to compromise, and attempt to resolve as many problems as possible rather than aim for perfection. It is probable that the solution adopted has created some errors but that these creations are more than outweighed by the resolution of other errors, and the fact that the rules which inform both corrections and errors were applied consistently. The solution was based on four pieces of information. First the information on the type of census enumeration book used (were available) ; second the address given in the enumerators' books (again, where available); third the relationship between the number of related kin within the household; and fourth the size of the 'household' or co-residential group. The combination of these variables can be summarized as follows:
- If the household is not already defined as an institution, has more than 20 residents of whom 10 or more have a relationship to head of household which are 'miscellaneous' (i.e. not kin, not inmates, not boarders, not lodgers, not servants and not visitors) then the household shall be defined as an institution.
- If the household is not already defined as an institution and there are more than 20 residents, more than two-thirds are institutional inmates, household inmates, family inmates or servants, then the household shall be defined as an institution.
- If the household is not already defined as an institution and there is a valid 'institution- word' in the address, there are more than six residents and the ratio of kin to non-kin is greater than 0.8 then the household is redefined as an institution.
- If the household is defined as an institution, there is no valid institution word in the address, there are less than 24 residents and the ratio of kin to non-kin is less than 0.8 then the household is redefined as an ordinary household.
- If the household is defined as an ordinary household and there is a valid 'vessel word' in the address then the household is redefined as a vessel.
These alterations are reflected in the DOCTYPE variable. For example, the first three of these types would result in the variable DOCTYPE being set to value 4, which indicates that it would have been value 1 had not the enrichment program altered it. (For description of variable values, see the relevant entry in the Metadata section of this website).
Institution-words, mentioned above, are one of the means of discerning whether a household is an institution or not. These words generally assist in defining institutions with between 6 and 20 residents, i.e., they are households which will not be picked up by the first of the two rules above. The list of 'institution-words' which the enrichment process used was created on a careful examination of all the different types of institution used in the published census reports, and their corruptions. For example, 'Wkhuse' is an acceptable 'institution-word'. Added to this list are a variety of other terms which were not used in the published reports but would clearly fall into the definition of an institution based on living arrangements, e.g., hotel, tavern, and so forth.
Obviously this will cause anomalous results, a lodging housekeeper with five travelling salesmen residents will not be defined as an institution, while one with six will. The cut-off point is arbitrary but it is felt that this should not distort results grievously.
The other anomaly which needs mentioning relates to very large private households. The household of a nobleman may contain himself and his wife, and a retinue of servants. Under the first rule stated above, this private household might become an institution. This is problematic, as it will disguise the number of very large households. However, careful examination of the sample data suggests that more households are being correctly defined as institutions than households incorrectly so.
In the case of the 1921 census for England and Wales, institutions were assessed more thoroughly. As enumeration was delayed by strikes into late June, thousands of people were already enjoying their summer break and the number particularly of lodging-institutions was likely to be higher as would the number visiting a private household (particularly amongst the nobility). Prime Minister David Lloyd George, on holiday at Chequers with a large number of guests, was himself removed from the list of probable institutional inmates. To better distinguish between "true" institutions and others including lodging-institutions, a code for institutions was created for 1921 data only.
[1] M. Anderson, 'Standard tabulation procedures for the census enumerators' books, 1851–91', in E. A. Wrigley, (ed.), Nineteenth-century society: essays in the use of quantitative methods for the study of social data (Cambridge, 1972), 134–45.
The Cambridge Group for the History of Population and Social Structure