R & Census

explore census data with R

Benford's Law and Census Data


Benford’s law attracted lots of attention after the 2020 election, as some people attempted to use this law to question the integrity of the election. Benford’s law states that the frequency of the first digit of a large set of numerical data follows this rule: 30.1% are 1s, 17.6% are 2s, 12.5% are 3s, …, 4.6% are 9s. Therefore it can be used to detect data manipulation that leads to the violaton of the law.

Build a static website with R Shiny


Sounds stupid? Yes, it’s kind of throwing away 99% of Shiny’s power; and you can always build a static website with R markdown, blogdown, or bookdown. Anyway, please keep reading as it will save you time if you are an R users who want to make a portfolio website to showcase your work, but know little about web development, and yet prefer designing your own interface rather than using a popular theme.

Navigate through Decennial Census and American Community Survey


Finding the right content in census data can be daunting. Just give you an idea how complex the census data are, there are 1127 tables and 25070 columns of table contents in the 2012-2017 ACS 5-year summary file alone. dataset number of tables number of columns 2010 decennial census summary file 1 333 8959 2012-2017 5-year ACS summary file 1127 25070 2017 1-year ACS summary file 1372 33593 The complexity does not stop here.

Cities boomed and doomed


Population in the United States grew by 3.76% from 313.91 million in 2012 to 325.72 million in 2017. The growth, however, is not uniform geographically. One of the geographical areas of particular interest is the city. Some cities are shrinking in population while some others exploding. The purpose of this post is to find cities that decrease and increase most in population. In addition, we will show the population and its change of all cities in an interactive map.

Find Salamander -- Congressional District Gerrymandering


When Republicans or Democrats are in charge, they tend to draw election district boundaries to their favor. This can go very ugly, creating wacky boundaries by a practice called Gerrymandering. The word “Gerrymander” is a blend of two species: the Massachusetts Governor Elbridge Gerry and a mythological salamander, a dragon-like monster. Governor Gerry signed a bill 200 year ago that drew an state senate election district in a shape resembling a salamander.

Map prisons in the United States


Despite 1 in every 110 adults are locked behind bars in the United States, most people have no idea where these prisons are. This post helps you find out the location and number of inmates in federal prisons, state prisons, and local jails. The data are extracted from Decennial Census 2010. According to the census, a total number of 2,263,602 persons were incarcerated in the United States.

Is your hometown named after Lincoln or Washington?


Washington and Lincoln are the two greatest presidents in the history of the United States. This post, however, does not discuss how great they are. Instead, I want to show a fun fact related the two presidents: how many cities and towns (and equivalents) are named after them. I am living in Lincoln, Rhode Island, a town so named in honor of President Lincoln. The data are extracted from decennial census 2010, the most detailed census so far.

Plot pie charts of racial composition in largest metro areas on a map in R


Pie chart has been criticized for being a poor visualization and is not recommended in R community. The popular ggplot2 package discourages the use of pie charts and there is no dedicated geom_pie for it. Although the criticism is mostly valid, there is a case that pie chart can be useful: pie charts on maps. Pie charting on map is a compact way to show composition by locations. A recent R package, scatterpie by Guangchuang Yu, specializes in making pie charts at multiple locations.

Determine relationship between census geographic entities with totalcensus package


This is an application example of totalcensus package. The geographic hierarchy primer in Census 2010 summary file 1 technical documentation displays the relationship between geographic entities. The lower one of the two entities connected by a line is entirely within the boundary of the upper one. For example, a county subdivision is always within the boundaries of a county and a school district always within the boundaries of a state.

Proccess 5-digit ZIP Code Tabulation Area (ZCTA5) data with totalcensus package


This is an example of applications of totalcensus package. To install the package, follow the instructions at https://github.com/GL-Li/totalcensus. In this example, we examine data related to 5-digit ZIP Code Tabulation Area (ZCTA5), using 2016 ACS 5-year survey summary files. Let’s first load the libraries and then answer a couple questions related to zip code library(totalcensus) library(data.table) library(magrittr) library(ggplot2) library(ggmap) What’s the population in a ZCTA5 These population data can be easily obtained with the code below.

Build a R package for yourself


A R user can benefit a lot from building packages. I have read people writing about the benefit in various occasions and cannot agree more after building my first package. We don’t have to a be R developer to write packages. Developers write packages for others; we can just write packages for ourselves. As a R user, we must have written functions and collected datasets, and may have used them across projects or may want to use them later on.

Extract US Census 2010 data with data.table and dplyr


This post explains how to extract information from the original dataset of the US 2010 census summary file 1 with urban/rural update, using data.table or dplyr package in R. Why do we want to work with the original data? You may ask, when there are already R packages, such as UScensus2010, censusapi, and tidycensus, which help user get the data. The biggest benefit is that you will have full access to all the census 2010 data.

ggplot2: place text at right location


A common task in plotting is adding texts as labels or annotations to specific locations. ggplot() has functions geom_text(), geom_label() and annotate() for this purpose. In this post we discuss how ggplot2 controls positioning of text. First we need to specify (x, y) coordinate in the plot where the text is placed. By default, the center of the text is at (x, y), which is sometimes not what we want, as shown in the example below.

ggplot2: aes(group = ...) overrides default grouping


Default grouping in ggplot2 ggplot2 can subset all data into groups and give each group its own appearance and transformation. In many cases new users are not aware that default groups have been created, and are surprised when seeing unexpected plots. There are two ways in which ggplot2 creates groups implicitly: If x or y are categorical variables, the rows with the same level form a group. Users often overlook this type of default grouping.

My uniform way of using ggplot2


Just finished reading Hadley Wickham’s ggplot2 book, (second eition). Before that I have been using ggplot2 for a couple of years, mainly learned by reading documentation and searching for help online. The ggplot2 package is a powerful and comprehensive tool for generating static plots, but I also feel it is a little bit too flexible; the same plot can be made with many different ways. This fexibility provides obvious convenience but also introduces a lot of confusion and extra burden of memorization.

Compare data.table pipes and magrittr pipes


We have two ways to chain data.table operations, using data.table pipes or using magrittr pipes. For a data.table dt, the data.table pipes take the form of dt[][][]... and magrittr pipes dt %>% .[] %>% .[] %>% .... Let’s first compare the readability of the two pipes in the following examples. Hadley Wickham criticized the readability of data.table pipes in this stackoverflow post. The data.table pipes, however, are not that hard to follow for those who are familiar with data.