It is often said that the "hardest part of data science is getting good, clean data." However, this statement might oversimplify the challenge. Sometimes the data you need to solve a problem doesn't exist and you must either collect it yourself or develop a creative set of proxies instead. While such methods are inherently imperfect, their value must be judged against the alternative. As the adage goes: "essentially, all models are wrong, but some are useful."
Earlier this year, the City of New Orleans approached us with such a problem. After a particularly tragic incident in which a fire killed five people (and three children) in a house without a working smoke alarm, the city's analytics team wondered if it could utilize public data to do better. The challenge, as they described it to us, was to estimate the probability that homes on any given block lacked working smoke alarms. With these estimates, the city could more intelligently direct its fire prevention efforts, and in so doing, save lives.
Unfortunately, there was no clear data source to start with. While many cities and nonprofit organizations like the Red Cross conduct door-to-door fire safety inspections, they do not systematically collect data on every home they visit. So while they may keep track of the residences in which they install smoke alarms, they very often don't track information on those which already have working alarms. Without a statistically unbiased sample, we had to come up with a creative solution.
Our approach was to utilize the American Housing Survey (AHS) – an annual panel conducted by the Census which tracks highly-specific details about households. One of the questions it asked in 2011 was whether residents had a working smoke alarm. Unfortunately, for our use case the AHS was not enough alone – at best it only enabled us to develop a model which indicated which cities were most at risk. However, by systematically mapping variables from the AHS onto the American Community Survey (ACS), we were able to create a dataset which could facilitate predictions at the level of a census block group. You can read more about this process here.
Exploring the Data
Before we could develop our model, we first needed to get a sense of what we could reliably infer from the data. An initial issue we had to contend with was missing data. Since the AHS does not require that subjects respond to all questions, many variables were sparsely populated and unusable:
The plot above displays each variable in the AHS (y-axis) by the percentage of missing records (x-axis). As you can see, certain variables like pvalue (a home’s property value) and zincn (the household’s overall income) are missing in almost every record. As a result, we chose to remove these variables from our model since we could not be sure of their bias. In the end, we removed all variables with more than 50% missing records. Missing data for all remaining variables were dealt with by sampling from their existing distribution. In other words, when a respondent neglected to answer a given question, we assigned them a response based on how all other households answered. This way we could be sure that the variance of each variable was not affected.
Constructing the Model
To develop the risk model, we used all variables that passed the 50% non-missing data threshold to make a binary classification of whether or not a respondent in the AHS had a smoke alarm. To make our classifications, we used a model called a Random Forest which creates an ensemble of many different decision trees. This method is preferable since it allows us to use more predictors without worrying too much about overfitting, or creating a model which seems very strong on the training data but is wildly inaccurate when dealing with new data.
A challenge in the modeling process was dealing with the issue of so-called "rare events." In the AHS, only 4% of households indicated that their household lacked a working smoke alarm. As a result, if we were to simply guess that every household had a working smoke alarm, we would be correct 96% of time. However, we needed a way to make sure we maximized our true positive rate – the percentage of the times our model accurately guesses that a resident lacks a working smoke alarm – while still keeping the overall error rate relatively low.
To address this challenge, we employed a technique called oversampling in which we trained the model on a random subset of the data which had a more balanced distribution between residents with and without smoke alarms. To determine the optimum weight to oversample, we trained multiple models at different values of this weight and compared how it affected the true positive and overall error rates.
From this analysis we found that oversampling households without smoke alarms by a factor of 21 would result in a model which was equally capable of predicting whether a household didn't have a smoke alarm as it was of predicting whether it did.
Ultimately, our model's overall accuracy rate is 62% while still maintaining a true positive rate of 59%. If we consider our baseline true positive rate to be 4% (as in, if we were to randomly guess whether someone didn't have a smoke alarm in the AHS, we would only be correct 4% of the time), then our model represents a 15-fold improvement over the alternative. The model also indicates that race, the age of a house, poverty, and education-level seem to be most strongly correlated with not having a smoke alarm on a national level.
Another concern in constructing the model was ensuring that we accounted for local variations. While we may be fairly confident of our predictions at the national level, individual cities or even neighborhoods may exhibit very different features. For instance, many well-educated, affluent residents of New York City choose to remove or turn off their smoke alarms because their stoves are not properly ventilated. Our national model would be unable to capture such variations. In statistics, this phenomenon is known as Simpson's Paradox, which describes a situation where a relationship appears in subsets of data but reverses or disappears when these subsets are merged.
To address this issue, we also created identical models for each Metropolitan Statistical Area (MSA) in the AHS. MSAs are statistical units created by the Census which comprise cities and their surrounding suburban areas. While the AHS contains over distinct 200 MSAs, not all of these have the same coverage.
Unfortunately, most MSAs had less than 100 records in the AHS. Of these there was also substantial variation in the percentage of households without smoke alarms. To ensure that we only developed local models for MSAs with sample sizes large enough to make unbiased predictions we chose the top 29 which had more than 2000 respondents. As we might expect, these models were less accurate than our national-level model. On average, the overall accuracy of the MSA models was still 62% but the true positive rate dropped to 52%.
Generating the Scores
Armed with our models, we set about applying them to ACS data to generate risk scores for each census block group. Most of the difficulty in this task was addressed through our systematic merge of the AHS and the ACS described above. By applying our models to this dataset, we had a means of predicting, nationwide, which blocks are more likely to contain residents without working smoke alarms. These scores fall in a range between zero and one, with one representing absolute certainty that the the block contains residents without smoke alarms. We also created a second score for each block in each of the 29 MSAs for which we created separate models. The final risk score is an average of these two scores.
In New Orleans we also took the additional step of combining our predictions with two other census block-level indicators:
- population_risk : the percentage of residents under the age of 5 and over the age of 65
- home_fire_risk : the number of residential fires per capita.
These indicators are normalized to a scale of zero to one and then combined with our risk scores according to this simple equation:
overall_risk = (smoke_alarm_risk + (population_risk * 0.34 + home_fire_risk * 0.66)) / 2
Since the ultimate goal is to prevent death and injuries from home fires, our reasoning was that we should target not just residents without smoke alarms, but also those areas and populations most at risk of injury or death in a home fire. But while data about the age of residents is widely available from the Census, incident-level fire data is harder to come by. The National Fire Incident Reporting System is a promising source, but the data is self-reported and not geolocated. In releasing this model to the public, we're also encouraging cities to upload their own fire incident data to rebalance the focus of attention on areas with a history of having fires. This helps us further hone the scores towards the areas of highest perceived need.
Smoke Signals is a first step in using data driven decision making to help reduce fire deaths. However, it is still a work in progress. The addition of data about where smoke alarms are known to be installed offers a promising direction for continuing to refine the precision of the model. We hope that city governments and civic hackers alike will join us in this effort. If you're interested in digging into the analysis, you can see our code and documented process on GitHub.