Data science is a new name for a very old practice: large sets of data have been gathered and analyzed throughout time. The taking of a census—the procedure of systematically collecting information about a specified group of people—extends back into antiquity; centuries ago, census officials traveled the breadth of the Roman Empire every 5 years in order to count its population.
For the United States, the census played a significant role in the country’s founding ideologies. Grievances over a lack of colonial representation in British Parliament had spurred the American Revolution, and afterwards, delegates from the recently-independent States were determined that in their new government, power would be distributed more fairly among the people. Two of the Revolution’s most prominent issues—taxation and representation—would now be assigned according to a simple population count.
And so, the U.S. Census came to be.
Representatives and direct Taxes shall be apportioned among the several States which may be included within this Union, according to their respective Numbers, which shall be determined by adding to that whole Number of free Persons, including those bound to Service for a Term of Years, and excluding Indians not Taxed, three fifths of all other Persons. The actual Enumeration shall be made within three Years after the first Meeting of the Congress of the United States, and within every subsequent Term of ten Years, in such Manner as they shall by Law direct...U.S. Constitution, Article I, Section II
And the Founding Fathers knew about the perils of stale data. In 1787, at the time of the Constitutional Convention, the United States was a cluster of regions along the Atlantic coast, but the country was quickly growing, with the western frontier line moving steadily deeper into tribal lands. So every ten years, the census would be taken again, and the number of delegates representing each state in the House of Representatives would be adjusted to compensate for any shifts in the U.S. population.
As the country changed, so did its census. The questions the census asks, how it’s administered, where the data ends up—all of these have undergone dramatic modifications since 1790. The various iterations of the census reflect transformations in the relationship between the government and its citizenry, in ideas about race and identity, in what makes an individual an American—and all the ways a person can count.
1790: The First U.S. Census
One year after George Washington became president, Congress dispatched U.S. Marshals to visit every household in the 13 states, as well as a few areas that had yet to become states: the districts of Kentucky, Maine, Vermont, and the Southwest Territory, which later became Tennessee. The Marshals knocked on doors throughout the country and spoke to the head of each household—each “master, mistress, steward, overseer, or other principal person therein.” In addition to the name of the head of household, the Marshals also recorded the number of people in each household in the following categories:
Free white males over the age of 16
Free white males under the age of 16
Free white females
All other free persons
And just as the Constitution had directed, numbers of delegates in the House of Representatives were assigned in proportion to each state’s population, calculated as the sum of free constituents and three-fifths of slaves. The majority of Americans were not eligible to vote for any of these delegates, but they were all, in some way, represented. By this method, the Founding Fathers could convince themselves that theirs was truly a government “of the people.”
1830: Standardizing the Form
For the first four censuses, Marshals were not provided with forms to fill out—they were simply given instructions regarding which questions to ask. They drew up their own forms on whatever paper was available to them, which resulted in some problems when the time came to calculate the final results. The collected figure was on paper of varying dimensions and thicknesses; a stack of records was difficult to manipulate and archive. In 1830, for the fifth United States Census, the Marshals were provided with uniform printed questionnaires.
The country’s population had tripled since the first census, and so had the number of Marshals tasked with enumerating it. The questions had also expanded: the census now recorded numbers of unnaturalized foreigners; whether people were engaged in agriculture, commerce, and manufacture; and whether they were blind or deaf. It was clear, by now, that census data would have applications beyond the portioning of taxes and representatives. The data provided information about immigration, economic trends, and public health—information that would guide legislation and policy.
1850: Scientific Racism
As the industrial revolution began to take root in America, this census collected much more information than ever before. In addition to having separate forms for free people and slaves, the Marshals were given four more sets of questions that would provide insights into social and economic demographics. In total, there were six census schedules:
Free inhabitants, which recorded all free people by name, as well as how much real estate they owned, whether they had been married within the year, whether they were illiterate
Slave inhabitants: the name of the slaveowner, and whether each slave were “deaf and dumb, blind, insane, or idiotic”
Productions of agriculture: for each farm, the name of the landowner, how many horses or sheep or swine were there, how much wheat or rye or oats they’d produced
Products of industry: how much raw material used, and the number of hands employed
Social statistics: for each town, county, or city: how many public schools and libraries there were, how many periodicals, public paupers, criminals, churches
Persons who died: whether free or slave, married or widowed, cause of death.
Previously, census questions had only differentiated race as white or non-white (“colored” was used from 1800 to 1840)—essentially, this meant whether a person were white or black; it’s unclear how American Indians were recorded. On the 1850 census, however, there was a new racial category: mulatto, a person with both white and black ancestry.
In the mid-19th century, while Charles Darwin was still writing On the Origin of Species, another theory about the human species had achieved mainstream acceptance in Europe and the United States: polygenism. This was the idea that different races were actually different species, and American slaveowners loved it. Polygenism became a scientific defense for the institution of slavery: black people were inherently different from—and inferior to—white people, they argued. Prominent American scientists supported and promoted polygenism, including naturalist Charles Pickering, geologist Louis Agassiz, and surgeon Josiah Nott. In fact, Nott was a slaveowner himself, and he had special concerns about his country’s multiracial population.
If black and white people were different species, then multiracial people were a hybrid species with their own particular characteristics: namely, Nott claimed, short lifespan, infertility, physical weakness, and social degeneracy. He also believed that multiracial people were more intelligent than their “pure” black ancestors, and that they tended to put this intelligence to use by leading slave rebellions. For Nott, the mulatto was a potential threat to the health of American society, and should be tracked. It was through Josiah Nott’s recommendation that “mulatto” became a racial category on the U.S. Census. Counting people was no longer just about endowing them with proportional political representation. Now, counting people was also about maintaining a level of control over them.
1870: Expanding Categories
Very soon, the census added more racial categories. By 1870, the country had emerged from the American Civil War, and the Emancipation Proclamation had been signed. The separate “free inhabitants” and “slave inhabitants” questionnaires were no more.
And the census had begun to count two more groups: American Indians, starting in 1860, and Chinese, in 1870. These people were labeled “Indian” and “Chinese,” respectively.
A glance at historical context quickly reveals the motivation behind these two additions. For much of the early 19th century, many indigenous people had lived in their own sovereign nations, separate from the entity known as the United States. But as the decades wore on and settlers continued to encroach on native territories, American Indians were pushed—sometimes forcibly relocated—further north and west. By the latter part of the century, the American Indian Wars were winding down, and the government’s new policy was forced assimilation. Through a system of reservations, the U.S. absorbed the native tribes. And their members, now securely under the authority of the U.S. government, were enumerated by the census.
Meanwhile, immigration from China to California had begun with the 1848 Gold Rush. Chinese immigrants quickly became a source of cheap labor, especially as railroad workers in the 1860s. Some labor groups began to blame Chinese workers for keeping wages low, and racism against the Chinese surged in California. It was the age of Yellow Peril, so the Chinese had to be counted.
1890: Counting By Machine
Processing the 1880 census had taken nearly a decade. Rates of immigration to the U.S. had increased, so the country’s population had certainly grown significantly; census officials feared that processing the 1890 Census results would take more than 10 years, meaning that their data would be obsolete by the time it was available. The solution: technology.
The 11th U.S. Census, administered in 1890, was the first to be processed by machine. Federal Marshals were no longer tasked with carrying out the survey: instead, trained enumerators visited each household. Data was recorded by hand, as before, and then entered on punched cards to be fed into a tabulating machine. Calculating everything took six years—two years less than it had taken the Bureau during the prior Census. The machine’s inventor, Herbert Hollerith, later founded the Tabulating Machine Company, one of the companies that would be consolidated to form IBM.
More race categories were also added. The Reconstruction era had ended, and legislators continued to use racist scientific theories to justify Jim Crow laws and other discriminatory policies.
This time, black identity was sliced into four separate categories: black, mulatto, quadroon, octoroon. Enumerators were also given very specific directions for properly classifying people with black ancestry:
Be particularly careful to distinguish between blacks, mulattoes, quadroons, and octoroons. The word “black” should be used to describe those persons who have three-fourths or more black blood; “mulatto,” those persons who have from three-eighths to five-eighths black blood; “quadroon,” those persons who have one-fourth black blood; and “octoroon,” those persons who have one-eighth or any trace of black blood.
In the preceding decade, Japanese immigration to the United States had begun to accelerate. In prior censuses, all people of East Asian descent would have been categorized as “Chinese.” And in response to growing anti-Chinese sentiment, President Chester A. Arthur had signed the Chinese Exclusion Act, a federal law prohibiting Chinese immigration. The Empire of Japan was determined to protect its expatriates, however, and began a series of diplomatic maneuvers with the U.S. government; beginning in 1890, Japanese individuals were categorized as “Japanese” on the census.
1920: Politics of Whiteness
In the 15th United States Census, for the first time since 1870, “mulatto” was no longer a racial classification. Instead, enumerators were instructed to record all mixed-race people with black heritage as “Negro,” unless that person was predominantly American Indian “and the status as an Indian is generally accepted in the community.” The “one-drop rule”—the idea that anyone with any black ancestry was automatically black—had been prevalent in the United States since the era of slavery; in the early 20th century, this idea would make its way into laws, like Virginia’s Racial Integrity Act of 1924, which prohibited interracial relationships. The new trend in scientific racism was now eugenics, and lawmakers occupied themselves with creating policies that would keep white people white.
Meanwhile, the Philippines had become a territory of the United States, and a wave of Filipino immigration had begun; “Filipino” was added to the census in 1920. Other changes in immigration patterns had brought more new racial categories, including Korean and “Hindu,” which referred to all people from South Asia, regardless of their religion; many were actually Sikh.
In 1930, for the first—and only—time, “Mexican” was a racial classification. In 1848, following the end of the Mexican-American War, more than 70,000 Mexicans had suddenly found themselves citizens of the United States under the Treaty of Guadalupe Hidalgo. They had been categorized as “white,” possibly because the only other options were “black” and “mulatto.” Mexican Americans protested against being categorized separately, and by the next census, they were once again “white.”
It is critical to remember that racial classification is political—and that in the United States, there were tangible advantages in being considered white. Mexican Americans were not the only group that fought for whiteness. A Punjab-born U.S. Army veteran, Bhagat Singh Thind, had filed a petition for naturalization, which was restricted to “white persons.” Thind’s lawyers argued that his high-caste status and “disdain for inferiors” characterized him as white. In 1923, the Supreme Court ruled against Thind, finding that people of Indian descent were not white, and therefore could not marry white Americans in states with anti-miscegenation laws—or be granted citizenship through naturalization.
Experiments in Sampling
Some time between the authorization of the 1930 census and the beginning of the survey, the worst stock market crash in United States history accelerated the country’s descent into the Great Depression. In 1937, as the country continued to struggle through the Great Depression, Congress ordered another special unemployment census: a form was mailed to every residential address. This was an early exercise in statistical sampling: two percent of households received a special survey, through which the Census Bureau’s statisticians could assess the greater census’ accuracy.
1940: Assessing the Housing Shortage
The Depression had seen millions of Americans living in homeless encampments. In 1940, the government saw the need to assess the status of housing in the United States: just how many housing units there were, and what sort of public housing programs were needed. Congress authorized a new national census of housing.
After having experimented with sampling in the preceding decade, the Census Bureau decided to create a “long form” questionnaire that targeted 5 percent of the population. By this method, they could obtain more detailed demographic information without making the survey long and burdensome for everyone. The “short form” population survey had 34 questions about basic biographical information like sex, age, and employment status; the “long form” survey contained 16 additional questions that went into greater detail.
Since the census’ inception, some respondents had had some reservations about the government collecting their information, potentially intruding on individual privacy, and what was going to become of all the data obtained. In 1942, these fears became realized for one particular group of Americans: the results of the 1940 census were used to round up Japanese Americans for internment.
For decades, questionnaires had been filled out by enumerators. Racial classification was often done with a visual assessment of the person in question. In 1960, for the first time, Americans completed their own forms. The Census Bureau mailed questionnaires directly to homes, to be filled out and later collected by a visiting enumerator. Americans were now asked to identify themselves according to race—rather than having an enumerator make the judgement.
Another first: the Census Bureau used FOSDIC (Film Optical Sensing Device for Input to Computers) to process the short-form questionnaires. Previously, clerks had to read filled-out forms and enter data onto punch cards; now, the filled circles were photographed onto microfilm, converted into data, and transferred onto magnetic tape.
Beginning in 1970, the longer form also included a question on Hispanic origins, kept separate from the question on race. People receiving the extra 29 questions could identify themselves as being of Mexican, Puerto Rican, Cuban, Central or South American, or “Other Spanish” descent while continuing to classify themselves as any race.
1980: Dealing with the Undercount
The U.S. Census had always known that its enumeration of the population was imperfect. In 1790, the Census had counted 3,929,214 inhabitants of the United States, including about 700,000 slaves. Both George Washington and Thomas Jefferson are known to have doubted the accuracy of this figure, as they had imagined the population to be larger. Washington, in particular, blamed “the indolence of the people and the negligence of many of the Officers.” It is probable that there was, indeed, an undercount of the population. Much of the country, at the time, was wilderness. Roads and boundaries were irregular; traveling long distances by horseback was arduous. Isolated settlements may have been overlooked. Some Americans refused to participate: the office of the United States Marshal had been created only a year prior to the first Census. People were unaccustomed to trusting their authority, and suspected that answering their questions would only result in increased taxation.
Besides, the survey had always been administered by household, whether it had been a U.S. Marshal knocking on doors, or a postal worker delivering a questionnaire to each home. What about people without homes?
In 1980, the Census Bureau began to focus on reducing the undercount of homeless people. The Bureau instituted Mission Night, or M-Night, a one-night endeavor during which enumerators visited homeless shelters, bus and railroad stations, soup kitchens, and jails. A decade later, this effort was repeated as Shelter and Street Night, or S-Night. But the Bureau continued to struggle with assessing just how many homeless people they had failed to count.
But the homeless were not the only victims of the undercount. In 1988, a number of major cities—including New York, Los Angeles, and Chicago—filed suit against the Census Bureau, alleging that the Bureau had undercounted urban minorities. Indeed, the undercount of black people had been a known problem. In the 1940s, when many more young men enlisted for the draft than census numbers had predicted, mostly because the survey had failed to reach a disproportionate number of young black men, the Census Bureau developed a system of measuring the undercount. Analysis of the results of the 1980 Census showed that the undercount rate for black people was, in fact, much higher than that of other races.
As a result of the New York lawsuit, a federal district court ordered the Census Bureau to adjust its numbers to compensate for the undercount; this decision was later stayed by the Supreme Court. Whether to adjust or not became the subject of political debate: an adjusted 1990 census would likely have increased Democratic representation in the House. Eventually, the unadjusted figures were used for redistricting.
2000: Entering the Modern Era
For four decades, the Census Bureau had used FOSDIC to read completed questionnaires. In 2000, FOSDIC was finally replaced by a new technology: optical character recognition, or OCR. Each returned questionnaire was scanned, and handwritten responses were converted into ASCII for processing.
While data processing technology moved forward, some technical decisions were still influenced by political and legal realities. To avoid a conflict with the 1996 Defense of Marriage Act, the Census Bureau implemented an automatic “edit procedure” when tabulating collected data. If a same-sex couple had reported themselves as married, the response was flagged as invalid, and the information was automatically changed so that they were recorded as “unmarried partners.”
2010 and beyond
The government had always been concerned with keeping response rates high, whether by encouraging people to be honest with visiting Marshals, or to mail back a filled-out questionnaire. People receiving the long form survey tended to have a lower response rate. To combat this problem, the Census Bureau had begun to develop the American Community Survey, which would be administered separately from the Census. As a result, the 2010 Census questionnaire had only 10 questions; more detailed information would be drawn from the results of the American Community Survey.
The next census will be in 2020, and there has already been some talk about the modifications that will be made. The Census Bureau may try to combine the “Hispanic, Latino, or Spanish origin” question with the race question, to reduce confusion for people who consider their race to be Hispanic or Latino. The Bureau has also been exploring adding a category for people of Middle Eastern and North African descent, but the circumstances are complex. Recognition of a MENA ethnic category means acknowledging the existence and validity of Arab Americans, which could be meaningful, considering post-9/11 xenophobic rhetoric and Islamophobia in the United States. But historically, racial enumeration has been a means of both recognition and control; in 2004, concerns were raised when the Census Bureau gave information on Arab Americans to the Department of Homeland Security. At any rate, it’s too early to tell.
We have a tendency to look at numbers and statistics as some sort of incontrovertible source of truth, untainted by human prejudices and social myths. But data cannot be divorced from its context. The results of the U.S. Census have provided researchers with an incredible wealth of information, but it is crucial to remember that the people being counted did not necessarily fit into the categories provided.