数据转换代写 Prac 3 – Data Transformation & Exploration
数据转换代写 Instructions. In this practical, you will walk through some examples on data transformation, exploration and imputation. You will…
Instructions. In this practical, you will walk through some examples on data transformation, exploration and imputation. You will need to solve some problems following the examples. Place your answers in the cell below the problems, and you may add new cells if needed. Make sure your code is included in your answers.
You may find it helpful to read through the whole notebook and learn from the examples first, before solving the problems.
1 Part 1 Data Transformation 数据转换代写
In computer programming, a data type of a variable determines what type of values it can contain and what operations can be performed. The following table summarises some of the common data types we will come across in R:
Data Type | Example | R Code |
Logical | TRUE, FALSE | v <- TRUE |
Numeric | 12.3, 5, 999 | v <- 23.5 |
Integer | 1L, 34L, 0L | v <- 2L |
Character | “a”, “good”, “TRUE”, ’23.4’ | v <- “TRUE” |
You can use the class(var) syntax to determine what object type a variable is, as demon- strated in the following example:
In [ ]: char_var <- "12345" num_var <- 12345 class(char_var) class(num_var)
Having a dataset that contains inconsistent data types is a common data cleaning problem. The above example demonstrates two different ways the number 12345 could be expressed in a dataset, as a character or a numeric value. The data type of the variable determines what sort of operations can be performed on them. 数据转换代写
Datasets that are interpreted as the wrong data type, or that are inconsistent need to be cleaned so that the desired operations can be performed on the dataset. If a column that is supposed to contain integers contains characters, for example, we can no longer run numeric functions such as mean() on those values:
In [ ]: a <- c(15, 20, 1, 10) b <- c("15", "1", "4") # Valid! We can compute the mean of numberic values. print(mean(a)) # You can't compute the mean from a set of characters by a number. print(mean(b))
Luckily, R contains built-in functions designed to convert between data-types. If we use the as.numeric() function, we can attempt to convert a character to a numeric type. Let’s try the above example again, making sure we convert the characters to a numeric value before attempting to run the mean function:
In [ ]: b <- c("15", "1", "4") mean(as.numeric(b))
Converting between data-types is a common feature of data-cleaning and transformation. Let’s demonstrate by cleaning an example dataset.
This dataset is a modified version of food-borne gastrointestinal illness in the US in 1940. The data has been modified from the original to include Brisbane-based addresses and some more obvious data integrity issues have been injected.
In [ ]: library(readr) oswego <- read_csv("OswegoTutorial.csv")
Our first step might be to explore the data and get a feel for the type of data we are working with. We can use the head() function to have a look at the first few rows of the data:
In [ ]: head(oswego)
If we want to determine how large the dataset is, we can use the dim() function to determine the dimensions:
In [ ]: dim(oswego)
Let’s have a closer look at the “age” column: 数据转换代写
In [ ]: oswego$age
Do you notice anything odd about this row? It looks like there was a data input error, and the number 7 has been inserted as the word “seven”. Let’s see how this affects our data analysis by trying to run the mean() function on the age column to determine the average age of people in our dataset.
In [ ]: mean(oswego$age)
As you can see, we get an error, saying that the “argument is not numeric or logical”. It looks like we can only run the mean function on a column that is numeric or logical. What data type is the age column?
In [ ]: class(oswego$age)
We can see that R has interpreted this column as the ’character’ data type, which explains why we can’t run the mean function on it. Let’s first replace the character(s) “seven” with “7” so that we can easily convert the whole column to a numeric data type.
In [ ]: oswego$age[oswego$age == "seven"] <- "7"
Let’s break down the above query a little bit. The oswego$age == “seven” is what is called a conditional statement, it matches values in oswego$age according to a specific condition. In this case, we match any of the rows that have the value “seven”. We then set any of these rows to the character “7” instead.
| TASK
In [ ]: # [Place your Answer here]
You may have noticed that in our oswego dataset, we have the full address of each patient. Can you think of how this may be useful?
One use of this locational data might be to see if outbreaks are clustered around particular sub- urbs/areas. In this case, being able to query the postcode directly would be useful, but currently the postcode is within the ’Address’ column, which has the format:
38 Jones Road, South Brisbane, QLD 4101
One common aspect of preparing your data for use is making sure that it is in a format that is as simple to query as possible. For example, in the above case, if we wanted to find out what the most common suburb is which contained an outbreak, we would not be able to easily query suburb specifically with the address column as it contains a lot of superflous information.
One solution might be to transform the data so each part of the address is in its own column – that way we query against a much simpler attribute such as postcode.
Let’s demonstrate this by splitting up the “onsetdate” column first as an example. You can see that the onsetdate is in the format:
19-Apr
If we wanted to query directly by month, we could separate the “Month” part of the address directly into its own column. Let’s do that. We will use the “tidyr” library in R, a popular data transformation library. Run the code below and try and understand what it is doing.
In [ ]: library(tidyr)
oswego_new <- separate(oswego, onsetdate, into = c("onset_day", "onset_month"), sep = head(oswego_new)
| TASK
In [ ]: # [Place your Answer here]
Now that we’ve separated the columns, we can directly query “month” to see the range of months that these outbreaks occurred.
In [ ]: unique(oswego_new$onset_month)
We can see that this data is limited to April, June and also that a lot of the onset date data is missing, resulting in ’NA’ values. Now let’s see how we can get useful information from the location information.
| TASK
In [ ]: # [Place your Answer here]
| TASK
In [ ]: # [Place your Answer here]
|
Extension TASK
In [ ]: # [Place your Answer here]
| TASK
In [ ]: # [Place your Answer here]
2 Part 2 Data Exploration
In Prac. 1, we first encountered the HR Analytics dataset.
The question we seek to answer is: Why are our best and most experienced employees leaving?
To get to grips with the data, we will carry out some exploratory data analysis (EDA) tech- niques in R.
Firstly, let’s import the data and look at a few rows.
In [ ]: library(readr) HR_comma_sep <- read_csv("https://stluc.manta.uqcloud.net/mdatascience/public/datasets HR_comma_sep
| TASK
What type of variable is left?
In [ ]: # [Place your Answer here]
Now, let’s count the number of rows with missing data.
In [ ]: sum(!complete.cases(HR_comma_sep))
Since there is no missing data, we can proceed with our analysis on the complete data set. Let’s explore some simple quantitative summaries. The default summary statistics are as fol-lows.
In [ ]: summary(HR_comma_sep)
These statistics tell us a bit about each variable in isolation; what we would like to do is obtain statistics pertinent to our question.
To proceed, it is useful to classify each variable as either a response or a predictor. For this problem and data set, it is clear that left is the response and all other variables are predictors.
For example: What is the breakdown of left by job Department (sales)?
This is easily achieved by creating a two-way contingency table, which counts the number of intances in each left by sales cell in a two by two table.
In [ ]: test_table<-table(HR_comma_sep$sales,HR_comma_sep$left) test_table
Marginal counts and proportions are easily created as follows. 数据转换代写
In [ ]: margin.table(test_table) margin.table(test_table,1) margin.table(test_table,2) prop.table(test_table) prop.table(test_table,1) prop.table(test_table,2)
These proportions can also be conveniently visualised using a mosaic plot, where the widths of rectangles are proportional to the number of observations in the x variable categories (Department), and, for each x variable category, the heights are proportional to the number of observations in the corresponding y variable categories (Left).
In [ ]: mosaicplot(test_table,main="Mosaic Plot of Left by Department",xlab="Department",ylab=
| TASK 数据转换代写
In [ ]: # [Place your Answer here]
How about the breakdown of average hours worked per month by left?
We will first create visual summaries using boxplots (which display summary statistics visu- ally).
In [ ]: boxplot(HR_comma_sep$average_montly_hours~HR_comma_sep$left,xlab="Left",ylab="Average
This simple visual summary suggests that employees that left typically worked longer hours. We can get a more detailed view by constructing a histogram as follows.
In [ ]: hist(HR_comma_sep$average_montly_hours[HR_comma_sep$left==1], col=rgb(1,0,0,0.5),main= hist(HR_comma_sep$average_montly_hours[HR_comma_sep$left==0], col=rgb(0,0,1,0.5),add=T legend("topright", c("Not Left","Left"),fill=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))
The histogram reveals an interesting feature of the data; the distribution of average monthly hours for employees that left is multimodal. This suggests employees that leave fall into two groups: those that work normal hours and those that work long hours. 数据转换代写
The same information can be seen from a plot of the empirical cumulative distribution function (ecdf) for average monthly hours by left.
In [ ]: E0<-ecdf(HR_comma_sep$average_montly_hours[HR_comma_sep$left==0]) E1<-ecdf(HR_comma_sep$average_montly_hours[HR_comma_sep$left==1]) plot(E0,col=rgb(0,0,1,0.5),verticals = TRUE, do.points = FALSE,main="ECDF for Average plot(E1,col=rgb(1,0,0,0.5),verticals = TRUE, do.points = FALSE,add=TRUE) legend("bottomright", c("Not Left","Left"),lwd=1, lty=c(1,1),col=c(rgb(0,0,1,0.5), rgb
We can also look at the time spent in the company by left.
In [ ]: par(mfrow=c(2,1)) # Create 2 by 1 figure # Plot Frequencies hist(HR_comma_sep$time_spend_company[HR_comma_sep$left==0],breaks=c(0.5,1.5,2.5,3.5,4. hist(HR_comma_sep$time_spend_company[HR_comma_sep$left==1],breaks=c(0.5,1.5,2.5,3.5,4. legend("topright", c("Not Left","Left"),fill=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))
# Plot Proportions hist(HR_comma_sep$time_spend_company[HR_comma_sep$left==1],breaks=c(0.5,1.5,2.5,3.5,4. hist(HR_comma_sep$time_spend_company[HR_comma_sep$left==0],breaks=c(0.5,1.5,2.5,3.5,4. legend("topright", c("Not Left","Left"),fill=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))
We see from the above plots that the densities (Density) are easier than the frequency counts to visually compare, although relative magnitude information is lost in the process.
The shape of the distributions for time spent at the company are noticably different, withem- ployees that have been at the company for very short of very long periods of time being less likely to leave. Most employees that leave the company have worked there for between three and five years, inclusive.
| TASK 数据转换代写
In [ ]: # [Place your Answer here]
| TASK
In [ ]: # [Place your Answer here] Suppose we wish to plot last evaluation and average monthly hours by leave. One way to do this is with a scatter plot. In [ ]: plot(HR_comma_sep$average_montly_hours[HR_comma_sep$left==0],HR_comma_sep$last_evaluat points(HR_comma_sep$average_montly_ho rs[HR_comma_sep$left==1],HR_comma_sep$last_evalu legend("bottomright", c("Not Left","Left"),pch=c(1,1),col=c(rgb(0,0,1,0.5), rgb(1,0,0,
From the scatter plot, we gain additional insight into the relationship between last evaluation, average monthly hours, and those that left. The bulk of those leaving that worked normal hours also received low evaluations, and most of those leaving that worked long hours received high evaluations.
We can break down the scatterplot further by an additional variable with the construction of a scatter plot matrix. For instance, we might be curious how satisfaction level, last evaluation, average monthly hours, and left relate to eachother.
In [ ]: library("lattice") #Load the `Lattice' graphics package
HR_subset<-subset(as.data.frame(HR_comma_sep),select=c('average_montly_hours', 'last_e super.sym <- trellis.par.get("superpose.symbol") #Get Symbol Plotting Information splom(~HR_subset[1:3],groups=left,data = HR_subset,varnames=c("Average \nMonthly \nHou key = list(title = "Scatterplot Matrix", columns = 2, points = list(pch = super.sym$pch[1:2], col = super.sym$col[1:2]), text = list(c("Not Left", "Left"))))
| TASK
In [ ]: # [Place your Answer here]
So far, we have been exploring the entire data set. Let us return to the original question: Why are our best and most experienced employees leaving?
To get to grips with this, we need to identify which subset of employees are “best” and “most ex- perienced”. Precisely what this means to any particular person is ambiguous. When encountering ambiguity in the problem, the process of resolving that ambiguity involves a two-way dialogue with the problem poser.
Broadly, one might imagine this subset to contain: 数据转换代写
- Employeeswith high
- Employeesthe have been with the company for a
Additional criteria might be:
- Employeesthat work on a large number of
- Employeesthat work a
For now, suppose that the “best” employees are those with an evaluation of 0.8 or higher, and the “most experienced” employees are those that have been with the company for 4 or more years.
| TASK
In [ ]: # [Place your Answer here]
| TASK
In [ ]: # [Place your Answer here]