After seeing an announcement about it on R-Bloggers, I decided to test out the new
xray package using the Titanic data set. The
xray package provides a few functions for quickly getting a summary of anomalies and distributions of the variables in a data set. For anomalies, the
anomalies() function outputs the number and percentage of NA’s, zeroes, empty strings, and infinities while also giving some useful information about distinct observations and variable class. For distributions, the
distributions() function outputs boxplots and histograms of each variable. For numerical variables, the histograms also have a density curve overlay. These plots give a good starting point for further analysis.
I used the Titanic data set hosted by Kaggle to play around with the
xray package. Below is a partial reproduction of my Kaggle kernel showing an exploratory data analysis of the Titanic data set using
xray. For the full analysis, see the linked kernel.
Titanic EDA Using
To get a sense for the Titanic data set, we are going to do some exploratory data analysis. As part of that, we will use the
xray package to check for anomalies and look at distributions.
library(tidyverse) library(xray) library(knitr)
First, we need to import the training.
training <- read_csv("posts_data/train.csv", progress = FALSE)
Now, let’s take a look at the data using
## Observations: 891 ## Variables: 12 ## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,... ## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,... ## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra... ## $ Sex <chr> "male", "female", "female", "female", "male", "mal... ## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ... ## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,... ## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,... ## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138... ## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ... ## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ... ## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
We have 12 variables that we will want to look at. According to the data set description from Kaggle, the variables are defined in the following way:
- PassengerId – A numerical id for each passenger.
- Survived – The response variable as a factor showing survival.
- Pclass – The ticket class of the passenger used as a proxy for socio-economic class. This variable is a factor with levels 1 = 1st (Upper Class), 2 = 2nd (Middle Class), and 3 = 3rd (Lower Class).
- Name – The full name of the passenger with title. Married female passengers appear to have their partner’s name listed with their actual name in parentheses.
- Sex – A factor showing the sex of the passenger.
- Age – The age of the passenger. Estimated ages are given as xx.#.
- SibSp – The number of siblings and spouses the passnger has aboard.
- Parch – The number of parent/children relationships that the passneger has aboard.
- Ticket – Ticket number of the passenger.
- Fare – The passenger fare.
- Cabin – The passenge cabin number.
- Embarked – The port of embarkation. This is a factor with three levels C = Cherbourg, Q = Queenstown, S = Southampton.
Before we begin the exploration, let’s recode the factors with proper levels.
training <- training %>% mutate(Survived = factor(Survived, levels = c(1, 0), labels = c("Yes", "No"))) %>% mutate(Pclass = factor(Pclass, levels = c(3, 2, 1), labels = c("3rd", "2nd", "1st"), ordered = TRUE)) %>% mutate(Sex = factor(Sex, levels = c("male", "female"))) %>% mutate(Embarked = factor(Embarked, levels = c("C", "Q", "S"), labels = c("Cherbourg", "Queenstown", "Southampton")))
Checking Anomalies with
Now that the data sets have been re-coded with factors, let’s take a look for any anomalies in the data set using the
xray package. The
xray package gives access to several functions for performing exploratory data analysis. The first one we will use is the
anomalies() function that returns a data frame summarizing the variables in a provided data frame.
The function returns summaries for each variable showing:
- The number of observations
- The number and percentage of observations that are NA
- The number and percentage of observations that are zero
- The number and percentage of observations that are blank
- The number and percentage of observations that are infinite
- The number of distinct observations
- The variable type
- The percentage of observations that are “anomalous”
This information can give us a starting point to examine the data set and see what variables will be useful for training our models.
xray::anomalies(training)$variables %>% knitr::kable()
From calculating the anomalies on the training data, we can see that most of the entries in the Cabin variable are NA. There are two options with this entry as we can either remove the Cabin variable entirely or recode all of the NA’s as “No Cabin” or some similar value. We will investigate the other two variables with NA’s below to see if we can turn them into something useful.
There are also three variables that show a number of zero observations. For the Parch and SibSp variables, this should be fine. This just means that those passengers are traveling without family members. For the fare, we may want to investigate which of the passengers have zero fare as they may be crew members or have some other identifying information.
One interesting point that pops out from the anomalies function is that the Ticket variable has only 681 distinct observations. This means that there must be a few repeat tickets that bears a quick investigation.
Note: As of this writing, the
xray does not currently support ordered factors. This is why the
Pclass variable has a class of “Unknown” from the anomalies() function. This will also cause a problem with the distributions() function so we will remove the ordering for that. This has been fixed in the current iteration of the package available on Github.
Next, we will use the
distributions() function from the
xray package to visually inspect the variables. This function returns a series of box plots (for categorical variables) and histograms (for numerical variables) along with giving some distributions and breakdowns for the numerical variables.
For the bar plots, the axis labeled as “Rows” is actually the number of observations of each variable category. For the histograms, the number of observations has been scaled for the associated density plot.
training %>% select(-PassengerId, -Name) %>% mutate(Pclass = factor(Pclass, ordered = FALSE)) %>% xray::distributions()
## Variable p_1 p_10 p_25 p_50 p_75 p_90 p_99 ## 1 Parch 0 0 0 0 0 2 4 ## 2 SibSp 0 0 0 0 1 1 5 ## 3 Age 1 14 20.125 28 38 50 65.87 ## 4 Fare 0 7.55 7.9104 14.4542 31 77.9583 249.0062
distributions() function gives us a very rough visual tool for inspecting the variables. The function allows us to see some general trends and places that could use further checking. For instance, with this data set, it is clear that the fare is being skewed by large outliers. Both the cabin and ticket variables have many categories with low counts that make them hard to visualize with a bar plot. However, the
distributions() function works great for the other variables as a means of quick visual analysis.
The functions in the
xray package provide a simple way to quickly glean some insight from unknown data sets. By letting the user pull out information on the anomalies and distributions of variables, the user is able to easily compare the variables and determine which ones require further investigation. It is a wonderful stepping off point for deeper analysis and I look forward to see the package develop further.