payton_mccarthy – mth332 – Just another UMassD WordPress site

May 7, 2020May 7, 2020

Statistical Report 4

Professor Davis

MTH 332

5 May, 2020

The Search for the Origin of the Replication of Palindromes within CMV

Abstract: This study looks to compare the structures in the DNA for clusters of palindromes. The search is for the origin of the replication of the palindromes within the DNA of the human cytomegalovirus (CMV), a potentially life-threatening disease. The data is from M.S. Chee who published the DNA sequence of CMV in 1990, and also from M.Y. Leung who was able to implement search algorithms in a computer program in order to screen the DNA sequence for many types of patterns.

Intro and Background: DNA like a coded message that uses a four-letter alphabet, that is A, C, T, and G. The order of which these four letters is important, as patterns or repetitions of these letters within the DNA could be of significance, such as where the origin of replication within the DNA is. One type of pattern of focus here is that of a complementary palindrome. The letter-pairs of A-T and C-G are complementary to each other. A complementary palindrome is a sequence of letters that when in reverse is the ‘complement’ of the forward sequence. AT and TA are both complementary palindromes, short ones at that. CATG, GTAC, ACGT, and TGCA are also all complementary palindromes. In this study, there were 296 palindrome sequences that were at least 10 letters long found, the longest being 18 letters long. The entire CMV DNA molecule contains 229,354 complementary pairs of letters or base pairs. It is freaking huge.

Methods: I remember I had problems starting with this assignment where I had to go to your office for help. We figured it out eventually, but afterwards was Spring Break and then the nationwide quarantine occurred. It has been a struggle to keep up with everything on my plate since all of these drastic changes, but I will share what I was able to record.

To download the data, I used the code

data <- read.table(“hcmv.txt”, header = FALSE, sep = “,”)

data <- as.data.frame(t(data))

to label the dataset as ‘data’. I decided to define some variables to play with by defining L,

L <- c()

k <- 0

while(k<=56){

L <- c(L, sum(data >= k * 4000 + 1, data <= (k+1)*4000))

k <- k+1}

I then followed this by defining ‘counts’ and ‘lambda’ as,

counts <- as.data.frame(table(L))

lambda <- (294/57)

Now I defined the equation for the Poisson Distribution as:

PoissonDistribution <- function(j) {

i <- 0

PDvalues <- list()

while (i <= j)

{

k_points_sum <- sum(lambda ^ i / factorial(i))

PDvalue <- k_points_sum * exp(-lambda)

PDvalues[[i+1]] <- PDvalue

i <- i+1}

return(sum(PDvalues))}

Unfortunately, this was all I was able to put together.

Discussion and Conclusion: Since I was not able to fully complete this assignment, I do not have much conclusive results to share.

February 21, 2020

Statistical Report 3

Payton McCarthy

Professor Davis

MTH 332

21 February, 2020

Analysis of Carapace Sizes Before and After the Molting of Female Dungeness Crabs

Abstract: This study looks to compare collected data of the various sizes of carapaces (shells) from female Dungeness crabs. The data was taken after the year 1983 on the Pacific coast of North America. It was collected by the California Department of Fish and Game and various commercial crab fishers. This study aims to examine the relationship between premolt and postmolt carapace sizes and summarize the results both numerically and graphically.

Intro and Background: The Dungeness Crab is one of the largest and most abundant crabs on the Pacific coast. When it grows, the crab molts periodically, casting off its shell and growing a new one. It is reportedly easy to tell if a crab has molted recently by how clean its shell is. All measurements are in millimeters.

Methods: This study uses the computer program RStudio to make statistical calculations about the dataset. After downloading the data ‘crabs.data’ from the course website, use this code

crabsizes <- read.table(“crabs.data”,header=TRUE,sep=””)

to label the dataset as ‘crabsizes’.

Then I separated the data into two different lists, one of premolt sizes and and the other of postmolt sizes.

pre_crab_list <- pull(crabsizes, var = -5)

post_crab_list <- pull(crabsizes, var = -4)

I then calculated all of the descriptive statistics for both lists of values above by the following,

mean_pre_crab_list <- mean(pre_crab_list)

median_pre_crab_list <- median(pre_crab_list)

SD_pre_crab_list <- sd(pre_crab_list)

IQR_pre_crab_list <- IQR(pre_crab_list)

min_pre_crab_list <- min(pre_crab_list)

max_pre_crab_list <- max(pre_crab_list)

mean_post_crab_list <- mean(post_crab_list)

median_post_crab_list <- median(post_crab_list)

SD_post_crab_list <- sd(post_crab_list)

IQR_post_crab_list <- IQR(post_crab_list)

min_post_crab_list <- min(post_crab_list)

max_post_crab_list <- max(post_crab_list)

z_pre_crab_list <- (pre_crab_list – mean(pre_crab_list))/sd(pre_crab_list)

z_post_crab_list <- (post_crab_list – mean(post_crab_list))/sd(post_crab_list)

skewness_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^3) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 3)) )

skewness_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^3) / (length(post_crab_list)*(sd(post_crab_list) ^ 3)) )

kurtosis_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^4) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 4)) )

kurtosis_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^4) / (length(post_crab_list)*(sd(post_crab_list) ^ 4)) )

I then created a Normal Distribution graph for these two lists with,

ggplot(data = data.frame(pre_crab_list = c(30, 200)),

mapping = aes(x = pre_crab_list)) +

stat_function(mapping = aes(colour = “Premolted Carapace Sizes”),

fun = dnorm,

args = list(mean = mean(pre_crab_list),

sd = sd(pre_crab_list))) +

stat_function(mapping = aes(colour = “Postmolted Carapace Sizes”),

fun = dnorm,

args = list(mean = mean(post_crab_list),

sd = sd(post_crab_list))) +

scale_colour_manual(values = c(“blue”, “red”)) +

labs(x = “Carapace Sizes”,

y = “Probabilities”,

title = “Normal Distributions of Premolt Sizes v Postmolt Sizes”)

followed by two histogram plots for both lists as well.

hist(pre_crab_list,

main=”Histogram of Premolt Sizes”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkorchid1″,

xlim=c(30,160),

las=1,

breaks=100)

hist(post_crab_list,

main=”Histogram of Postmolt Sizes”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkslateblue”,

xlim=c(30,180),

las=1,

breaks=100)

These were just simply the entire lists of premolt and postmolt carapace sizes, but now this study will look to focus more on whether these shells grew in the field or in the lab, and look at any statistical differences between such. This was done so by splitting the original datatable into two, one of crabs molting in the field and the other of crabs molting in the lab.

field_crabs <- filter(crabsizes, lf == 0)

lab_crabs <- filter(crabsizes, lf == 1)

These two tables were each stripped down to a two lists each: Premolt sizes in the field, postmolt sizes in the field, premolt sizes in the lab, and postmolt sizes in the lab.

pre_field_crabs <- pull(field_crabs, var = -5)

post_field_crabs <- pull(field_crabs, var = -4)

pre_lab_crabs <- pull(lab_crabs, var = -5)

post_lab_crabs <- pull(lab_crabs, var = -4)

Now all the descriptive statistics for each of these four lists can be found fairly simply by:

mean_pre_field_crabs <- mean(pre_field_crabs)

median_pre_field_crabs <- median(pre_field_crabs)

SD_pre_field_crabs <- sd(pre_field_crabs)

IQR_pre_field_crabs <- IQR(pre_field_crabs)

min_pre_field_crabs <- min(pre_field_crabs)

max_pre_field_crabs <- max(pre_field_crabs)

mean_post_field_crabs <- mean(post_field_crabs)

median_post_field_crabs <- median(post_field_crabs)

SD_post_field_crabs <- sd(post_field_crabs)

IQR_post_field_crabs <- IQR(post_field_crabs)

min_post_field_crabs <- min(post_field_crabs)

max_post_field_crabs <- max(post_field_crabs)

mean_pre_lab_crabs <- mean(pre_lab_crabs)

median_pre_lab_crabs <- median(pre_lab_crabs)

SD_pre_lab_crabs <- sd(pre_lab_crabs)

IQR_pre_lab_crabs <- IQR(pre_lab_crabs)

min_pre_lab_crabs <- min(pre_lab_crabs)

max_pre_lab_crabs <- max(pre_lab_crabs)

mean_post_field_crabs <- mean(post_lab_crabs)

median_post_field_crabs <- median(post_lab_crabs)

SD_post_field_crabs <- sd(post_lab_crabs)

IQR_post_field_crabs <- IQR(post_lab_crabs)

min_post_field_crabs <- min(post_lab_crabs)

max_post_field_crabs <- max(post_lab_crabs)

z_pre_field_crabs <- (pre_field_crabs – mean(pre_field_crabs))/sd(pre_field_crabs)

z_post_field_crabs <- (post_field_crabs – mean(post_field_crabs))/sd(post_field_crabs)

z_pre_lab_crabs <- (pre_lab_crabs – mean(pre_lab_crabs))/sd(pre_lab_crabs)

z_post_lab_crabs <- (post_lab_crabs – mean(post_lab_crabs))/sd(post_lab_crabs)

skewness_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^3) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 3)) )

skewness_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^3) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 3)) )

skewness_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^3) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 3)) )

skewness_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^3) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 3)) )

kurtosis_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^4) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 4)) )

kurtosis_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^4) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 4)) )

kurtosis_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^4) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 4)) )

kurtosis_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^4) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 4)) )

The next thing performed was to display all this important statistical information in a visual way that would aid in understanding the results. The first way this is done is by looking at the Normal Curves of all four lists to compare them easily:

ggplot(data = data.frame(pre_field_crabs = c(75, 200)),

mapping = aes(x = pre_field_crabs)) +

stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Field”),

fun = dnorm,

args = list(mean = mean(pre_field_crabs),

sd = sd(pre_field_crabs))) +

stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Field”),

fun = dnorm,

args = list(mean = mean(post_field_crabs),

sd = sd(post_field_crabs))) +

stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Lab”),

fun = dnorm,

args = list(mean = mean(pre_lab_crabs),

sd = sd(pre_lab_crabs))) +

stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Lab”),

fun = dnorm,

args = list(mean = mean(post_lab_crabs),

sd = sd(post_lab_crabs))) +

scale_colour_manual(values = c(“darkgreen”, “red”, “chartreuse3”, “darkgoldenrod1”)) +

labs(x = “Carapace Sizes”,

y = “Probabilities”,

title = “Normal Distributions of (Premolt Sizes v Postmolt Sizes) v (In the Field v In the Lab)”)

Next I looked to create separate histograms for each of these lists:

hist(pre_field_crabs,

main=”Histogram of Premolt Sizes in the Field”,

xlab=”Carapace Sizes”,

border=”black”,

col=”chartreuse”,

xlim=c(110,160),

las=1,

breaks=100)

hist(pre_lab_crabs,

main=”Histogram of Premolt Sizes in the Lab”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkgoldenrod1″,

xlim=c(30,160),

las=1,

breaks=100)

hist(post_field_crabs,

main=”Histogram of Postmolt Sizes in the Field”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkgreen”,

xlim=c(120,170),

las=1,

breaks=100)

hist(post_lab_crabs,

main=”Histogram of Postmolt Sizes in the Lab”,

xlab=”Carapace Sizes”,

border=”black”,

col=”red”,

xlim=c(30,170),

las=1,

breaks=100)

Now this is where a procedure is developed for predicting a crab’s premolt size from its postmolt size, where the intention to derive an expression is displayed. First, set the means of the total premolt and postmolt lists be equal to y and x, respectively. Finally, the residual standard deviation can be calculated (SD_r).

y <- mean_pre_crab_list

x <- mean_post_crab_list

Then, the correlation coefficient (r) can be calculated, which will help us to find the slope (b_hat) of the regression line. Next is to solve for b_hat, and then a_hat.

r <- (1/length(post_crab_list)) * sum(((post_crab_list – x) / SD_post_crab_list) * ((pre_crab_list – y) / SD_pre_crab_list))

b_hat <- r * (SD_pre_crab_list / SD_post_crab_list)

a_hat <- b_hat * x – y

SD_r <- sqrt(1 – r^2) * sd(pre_crab_list)

The regression line prediction is thus: yi = b_hat * xi + a_hat.

To test if this equation is accurate, it was decided to do so on three different groups of postmolt sizes, that is the first group being all between 147.5mm and 152.5mm, the second group between 142.5mm and 147.5mm, and the last group being between 152.5mm and 157.5mm. The first block of code filters all the rows into 3 different tables according to the specified ranges above. The second and third blocks of code pulls out the two columns containing those postmolt and premolt sizes from those three tables, respectively.

group1_crabsizes <- filter(crabsizes, postsz >= 147.5, postsz <= 152.5)

group2_crabsizes <- filter(crabsizes, postsz >= 142.5, postsz <= 147.5)

group3_crabsizes <- filter(crabsizes, postsz >= 152.5, postsz <= 157.5)

group1_postmolt <- pull(group1_crabsizes, var = -4)

group2_postmolt <- pull(group2_crabsizes, var = -4)

group3_postmolt <- pull(group3_crabsizes, var = -4)

group1_premolt <- pull(group1_crabsizes, var = -5)

group2_premolt <- pull(group2_crabsizes, var = -5)

group3_premolt <- pull(group3_crabsizes, var = -5)

Now to write out the calculations for what each expected premolt size should be (y_hat#), and compare that to what it actually is (group#_premolt).

y_hat1 <- b_hat * group1_postmolt – a_hat

y_hat2 <- b_hat * group2_postmolt – a_hat

y_hat3 <- b_hat * group3_postmolt – a_hat

Compare now by finding each test statistic value:

test_statistic1 <- sum((y_hat1 – mean(group1_premolt))^2) / mean(group1_premolt)

test_statistic2 <- sum((y_hat2 – mean(group2_premolt))^2) / mean(group2_premolt)

test_statistic3 <- sum((y_hat3 – mean(group3_premolt))^2) / mean(group3_premolt)

Results: Here is a table sharing all of the values of the relevant and descriptive statistics described above. It has been color coordinated from green to red (green green-yellow, yellow, orange, and red). If a number is green then it is the largest in its row (descriptive statistic), and if it is red then it is the smallest in its row, and everything else in between.

Table:

Descriptive Statistic	Premolt Sizes	Postmolt Sizes	Premolt Sizes – Field	Postmolt Sizes – Field	Premolt Sizes – Lab	Postmolt Sizes – Lab
Mean	129.21186	143.89767	139.00901	152.96396	126.19945	141.10997
Median	132.8	154	140.1	154	128.9	143.7
Standard Deviation	15.86452	14.64060	7.25115	6.71997	16.56878	15.28078
IQR	18.325	15.45	7.70000	7	18.1	15.6
Minimum	31.1	38.8	113.6	127.7	31.1	38.8
Maximum	155.1	166.8	153.9	166.5	155.1	166.8

Z-3	81.6183	99.97587	117.25556	132.80405	76.49311	95.26763
Z-2	97.48282	114.61647	124.50671	139.52402	93.06189	110.54841
Z-1	113.34734	129.25707	131.75786	146.24399	109.63067	125.82919
Z0	129.21186	143.89767	139.00901	152.96396	126.19945	141.10997
Z1	145.07638	158.53827	146.26016	159.68393	142.76823	156.39075
Z2	160.9409	173.17887	153.51131	166.4039	159.33701	171.67153
Z3	176.80542	187.81927	160.76246	173.12387	175.90579	186.95231
Skewness	-1.99712	-2.33945	-1.09590	-1.10398	-1.88109	-2.27861
Kurtosis	9.72498	13.06052	4.67604	5.14670	8.97447	12.37337

A comparative look now at the Normal Distributions for all premolt carapace shell sizes against all postmolt carapace sizes will help to paint a clear, visual picture. It can be seen that postmolted carapace shells clearly have a larger probability of being larger than if they were premolted, but that is just common sense.

It also could be nice to compare the histrograms of both of these distributions.

It can be seen that this tells the same story. Now this study will look at the visual results of whether these shells grew relatively larger in the field or in the lab. First the Normal Distributions and then each of the four histograms.

It can be clearly noted that, when in the field, the carapace sizes have a much higher probability of being larger (premolt and postmolt) rather than when they are observed to molt in the lalb (premolt or postmolt).

The expression for predicting a crab’s premolt size from its postmolt size that was derived in the methods section is now brought forth here:

yi = b_hat * xi + a_hat

From the calculations made in the program, b_hat = 1.07 and a_hat = 24.89. Therefore, my regression line prediction expression is yi = 1.07* xi + 24.89.

For any postmolt size (xi) that is given, multiply it by 1.07 and then add 24.89 and you will receive a fairly accurate prediction for that crabs premolt size (yi).

The residual SD (SD_r) was calculated to be about 2.42mm. This means that when using the regression prediction equation above, about 68% of yi-values will fall within about 2.42mm above or below its prediction. And about 95% of yi-values will fall within about 4.85mm above or below its appropriate prediction. Finally, the values for each test statistic calculated between the expected and the observed differences in premolt sizes is:

test_statistic1 = 2.0489

test_statistic2 = 1.437982

test_statistic3 = 1.608636

Discussion and Conclusion: Looking at the values for the test statistics, they are relatively low, which appears to mean that the observed values and the expected values are so close that the fit seems appropriate. Therefore, this study completed what it set out to do, providing numerical and graphical data and also providing an expression that could predict accurately to a degree the premolt carapace size from only its postmolt carapace size.

February 14, 2020

Statistical Report 2

Payton McCarthy

Professor Davis

MTH 332

13 February, 2020

Analysis of Who Plays Video Games

Abstract: This study looks to compare data of the random surveys taken from students from the University of California, Berkeley. These students were enrolled in Statistics 2 during the Fall of 1994. This survey aims to look at the reported frequency of play from these students, whether the students like to play video games or not and why, and then a collection of general information about the students.

Intro and Background: There was an exam given the week before the survey was taken. 314 students had taken the exam, and only 95 of them were chosen randomly for the survey, 91 of whom had responded. The objective of this study is to explore the responses of the students in the survey with the intent of providing useful information to others.

Methods: This study uses the computer program RStudio to make statistical calculations about the dataset. After downloading the data ‘video.data’ from the course website, use this code

data <- read.table(“video.data”,header=TRUE,sep=””)

to label the dataset as ‘data’. Then I quickly labeled all the constants given to us,

N <- 314

n <- 91

nonrespondents <- 4

Then I separated the data into two different datasets, based on the first column “time”. This is simply the number of hours played in the week prior to survey, this data was separated into two lists of all the time-values of the students who did play in the last week, and the time-values of all the students within the last week. This was done so that statistical calculations about each of these two separate groups can be made. Do this by using

stu_DidPlay <- filter(data, time != 0)

stu_DidPlay_values <- pull(stu_DidPlay, var = -15)

values_time <- pull(data, var = -15)

The fractions of who did play can be calculated by

fraction_DidPlay <- length(stu_DidPlay_values) / n

Now all the descriptive statistics can be found fairly simply by simply computing them:

x_DidPlay <- mean(stu_DidPlay_values)

x_time <- mean(values_time)

Median_DidPlay <- median(stu_DidPlay_values)

Median_time <- median(values_time)

SD_DidPlay <- sd(stu_DidPlay_values)

SD_time <- sd(values_time)

IQR_DidPlay <- IQR(stu_DidPlay_values)

IQR_time <- IQR(values_time)

min_DidPlay <- min(stu_DidPlay_values)

min_time <- min(values_time)

max_DidPlay <- max(stu_DidPlay_values)

max_time <- max(values_time)

Next is to find the z-scores, skewness, and kurtosis for both groups. This is done by calculating the z-score for each value in both lists of students who did play and all students:

z_DidPlay <- (stu_DidPlay_values – mean(stu_DidPlay_values))/sd(stu_DidPlay_values)

z_time <- (values_time – mean(values_time))/sd(values_time)

These new lists are comprised of all the z-scores for all the time-values for both groups. Now the skewness and kurtosis can be calculated by:

skewness_DidPlay <- sum(z_DidPlay ^ 3) / (length(z_DidPlay) * (sd(stu_DidPlay_values)) ^ 3)

skewness_time <- sum(z_time ^ 3) / (length(z_time) * (sd(values_time)) ^ 3)

kurtosis_DidPlay <- sum(z_DidPlay ^ 4) / (length(z_DidPlay) * (sd(stu_DidPlay_values)) ^ 4)

kurtosis_time <- sum(z_time ^ 4) / (length(z_time) * (sd(values_time)) ^ 4)

We now want to calculate a 95% confidence interval for this data. I did so for both datasets:

lower_2SDinterval_DidPlay <- (x_DidPlay – (2 * SD_DidPlay)) / sqrt(n)

upper_2SDinterval_DidPlay <- (x_DidPlay + (2 * SD_DidPlay)) / sqrt(n)

lower_2SDinterval_time <- (x_time – (2 * SD_time)) / sqrt(n)

upper_2SDinterval_time <- (x_time + (2 * SD_time)) / sqrt(n)

The next thing performed was to display all the important statistical information in a visual way that would further aid in understanding the results. The first way this is done is by looking at the Normal Curves of both datasets and compare them easily, this was done using this code:

ggplot(data = data.frame(stu_DidPlay_values = c(0, 30)),

mapping = aes(x = stu_DidPlay_values)) +

stat_function(mapping = aes(colour = “Students Who Did Play”),

fun = dnorm,

args = list(mean = mean(stu_DidPlay_values),

sd = sd(stu_DidPlay_values))) +

stat_function(mapping = aes(colour = “All Students”),

fun = dnorm,

args = list(mean = mean(values_time),

sd = sd(values_time))) +

scale_colour_manual(values = c(“blue”, “red”)) +

labs(x = “Hours Played Last Week”,

y = “Probabilities”,

title = “Normal Curves for Hours Played Last Week Between Groups”)

Next I looked at the second column in the ‘video.data’ labeled “like”. This data tells us how much the student likes to play. 1 = ever played; 2 = very much; 3 = somewhat; 4 = not really; 5 = not at all. It was important to filter out the data where a 99 was used, as this would skew the data:

stu_like <- filter(data, like != 99)

stu_like_values <- pull(stu_like, var = -14)

The descriptive stats for this dataset were found by,

x_stu_like_values <- mean(stu_like_values)

Median_stu_like_values <- median(stu_like_values)

SD_stu_like_values <- sd(stu_like_values)

IQR_stu_like_values <- IQR(stu_like_values)

min_stu_like_values <- min(stu_like_values)

max_stu_like_values <- max(stu_like_values)

z_stu_like_values <- (stu_like_values – mean(stu_like_values))/sd(stu_like_values)

skewness_stu_like_values <- sum(z_stu_like_values ^ 3) / ((length(z_stu_like_values) * (sd(stu_like_values)) ^ 3)

kurtosis_stu_like_values <- sum(z_stu_like_values ^ 4) / ((length(z_stu_like_values) * (sd(stu_like_values)) ^ 4)

Then a second figure was made, this was another Normal Distribution graph that shows the probability of how much a student likes to play:

ggplot(data = data.frame(stu_like_values = c(1, 5)),

mapping = aes(x = stu_like_values)) +

stat_function(mapping = aes(colour = “Who Likes to Play”),

fun = dnorm,

args = list(mean = mean(stu_like_values),

sd = sd(stu_like_values))) +

labs(x = “How Much Students Like to Play”,

y = “Probabilities”,

title = “Normal Curve for How Much Students Like to Play”)

This kind of graph is beneficial for displaying the mean and standard deviation simply. The skewness and kurtosis can be clearly seen and is comparable.

Next thing done was to look at the frequency of play, that is, how often do these students normally play video games? 1 = daily; 2 = weekly; 3 = monthly; 4 = semesterly,

freq_play <- filter(data, freq != 99)

freq_play_values <- pull(freq_play, var = -12)

Descriptive stats,

x_freq_play_values <- mean(freq_play_values)

Median_freq_play_values <- median(freq_play_values)

SD_freq_play_values <- sd(freq_play_values)

IQR_freq_play_values <- IQR(freq_play_values)

min_freq_play_values <- min(freq_play_values)

max_freq_play_values <- max(freq_play_values)

z_freq_play_values <- (freq_play_values -mean(freq_play_values)/sd(freq_play_values))

skewness_freq_play_values <- sum(freq_play_values ^ 3) / (length(z_freq_play_values) * (sd(freq_play_values)) ^ 3)

kurtosis_freq_play_values <- sum(freq_play_values ^ 4) / (length(z_freq_play_values) * (sd(freq_play_values)) ^ 4)

Finally, a Normal Distribution was made for the students reported frequency of play,

ggplot(data = data.frame(freq_play_values = c(1, 4)),

mapping = aes(x = freq_play_values)) +

stat_function(mapping = aes(colour = “How Often Will Play”),

fun = dnorm,

args = list(mean = mean(freq_play_values),

sd = sd(freq_play_values))) +

labs(x = “How Often Students Will Play”,

y = “Probabilities”,

title = “Normal Curve for Frequency of Play”)

Results: Here is a table sharing all of the values of the relevant and descriptive statistics described above.

Table:

Descriptive Statistic	Hours Played by Students Who Did Play	Hours Played by All Students	If Students Like to Play (1-5)	Reported Frequency of Play of Students (1-4)
Mean	3.32647	1.24286	3.02222	2.70513
Median	2	0	3	3
Standard Deviation	5.63616	3.77704	0.87381	1.02068
IQR	1	1.25	1	2
Minimum	0.1	0	1	1
Maximum	30	30	5	4
Z-3	—	—	0.40079	—
Z-2	—	—	1.2746	0.66377
Z-1	—	—	2.14841	1.68445
Z0	3.32647	1.24286	3.02222	2.70513
Z1	8.96263	5.0199	3.89603	3.72581
Z2	14.59879	8.79694	4.76984	—
Z3	20.23495	12.57398	—	—
Skewness	0.01942	0.10566	0.83539	0.11051
Kurtosis	0.01539	0.195606	5.25736	1.75250

The fraction of students who played video games the week before the survey is about 37.4%.

The estimated 95% confidence interval for the amount of hours spent by students who did play the week before is (-0.833, 1.530), and for the amount of hours spent by all students in general is (-0.662, 0.922). A look at the Normal Distributions for these confidence intervals helps to paint a clearer picture:

It is impossible for students to have a negative amount of hours spent playing video games, which is visually shown by the graph above. Therefore, the confidence intervals are better stated to be (0, 1.530) and (0, 0.922), respectively.

We can compare this with the reported frequency of play and see that students tend to lean towards 3 more than 2, that is, they play about monthly rather more so than weekly. According to the table, all students across the board have an average frequency of play of 2.7 which seemingly corresponds to a sample average of 1.25 hours of gaming the week previous to the survey across all students, and an average of 3.3 hours of gaming across all students who actually played.

If we look at the final Normal Distribution, for how much students actually like to play, we will see that it looks far more standard. This would indeed seem to indicate that the students frequency of play may have been affected by the exam that was during the week previous to the survey.

Discussion and Conclusion: In this final part, we can look at some of the “attitude polls” from the students. By counting, we can easily see that 23/90 of the reported students ‘very much’ like to play video games, 46/90 ‘somewhat’ like to play, and 21//90 are ‘not really’ or ‘not at all’ into playing video games.

The reasons for such vary. However, some 72 participants gave reasons why they do like to play video games, as reported doing so mostly to relax, a 66/72 = 91.7%. Some also reported having secondary or even tertiary reasons they like to play, like 38.9% like the ‘feeling of mastery’, 37.5% play because they are ‘bored’, 36.1% play for the ‘graphics/realism’ that video games offer, and 33.3% like the ‘mental challenge’ that video games can present.

Some 83 participants also reported why they do not like to play video games. The two majority reasons being that it takes up ‘too much time’ at 57.8% and that it ‘costs too much’ at 48.2%. This is followed up by feelings of it being ‘pointless’ and ‘frustrating’, 39.8% and 31.3%, respectively.

February 7, 2020February 7, 2020

Statistical Report 1

Payton McCarthy

Professor Davis

MTH 332

7 February, 2020

Does Smoking Affect a Child’s Birth Weight?

Abstract: This study looks to compare data of the birth weights of babies from mothers who smoke and from mothers who do not. The data is from the Child Health and Development Studies (CHDS), which includes all pregnancies from 1960-1967 among women in the Kaiser Foundation Health Plan in Oakland, California. There is a claim that smoking is responsible for a 5.3 to 8.8 ounce reduction in birth weight. In addition, smoking mothers are about twice as likely as non smoking mothers to have a low birth-weight baby (under 88.2 ounces).

Intro and Background: There are many health care practitioners and health organizations that have given strong warnings to pregnant mothers who smoke. However, not everyone has taken these warnings seriously, for example, it was found in 1996 that 15% of pregnant women smoked through their pregnancy (National Center for Health Statistics). This study intends to look at the question if there is a difference between the birth weight of babies born to smoking mothers and non smoking mothers, and if so, is the difference significantly important?

Methods: When this data was recorded, the measurements of each of the 1236 babies were taken at birth. The weights were collected and measured in ounces. The smoking status of the mother was either tallied as a 1 or a 0 to denote if they were smoking or nonsmoking respectively. This study uses the computer program RStudio to make statistical calculations about the dataset. After downloading the data ‘babiesI.data’ from the course website, use this code

data <- read.table(“babiesI.data”,header=TRUE,sep=””)

to label the dataset as ‘data’. Then you want to be able to separate the data into two different datasets of smokers and non smokers so that you can make statistical calculations about each of these two separate groups. Do this by using

smokers <- filter(data, smoke == 1)

smokers <- pull(smokers, var = -2)

to create a list of birth weights of babies born to smoking mothers referred to as ‘smokers’. Do the same for the birth weights of babies born to non smoking mothers and refer to this as ‘nonsmokers’.

nonsmokers <- filter(data, smoke == 0)

nonsmokers <- pull(nonsmokers, var = -2)

Now all the descriptive statistics can be found fairly simply by simply computing them:

mean(smokers)

median(smokers)

sd(smokers)

IQR(smokers)

min(smokers)

max(smokers)

then,

mean(nonsmokers)

median(nonsmokers)

sd(nonsmokers)

IQR(nonsmokers)

min(nonsmokers)

max(nonsmokers)

Next is to find the z-scores, skewness, and kurtosis for both groups. This is done by calculating the z-score for each value in both lists of smokers and non smokers:

z_smokers <- (smokers – mean(smokers))/sd(smokers)

z_nonsmokers <- (nonsmokers – mean(nonsmokers))/sd(nonsmokers)

These new lists are comprised of all the z-scores for all the corresponding birth weights of smokers and non smokers. Now the skewness and kurtosis can be calculated by:

skewness_smokers <-sum(z_smokers ^ 3) / (length(z_smokers) * (sd(smokers)) ^ 3)

kurtosis_smokers <- sum(z_smokers ^ 4) / (length(z_smokers) * (sd(smokers)) ^ 4)

skewness_nonsmokers <- sum(z_nonsmokers ^ 3) / (length(z_nonsmokers) * (sd(nonsmokers)) ^ 3)

kurtosis_nonsmokers <- sum(z_nonsmokers ^ 4) / (length(z_nonsmokers) * (sd(nonsmokers)) ^ 4)

We now want to calculate the probability that a baby could be born underweight (88.2 ounces or below) as mentioned in the abstract. We want to do so for babies born to mothers who smoked and mothers who did not.

underweight_smokers <- (88.2 – mean(smokers))/sd(smokers)

underweight_nonsmokers <- (88.2 – mean(nonsmokers))/sd(nonsmokers)

The next thing performed was to display all the important statistical information in a visual way that would further aid in understanding the results. The first way this is done is by and Box and Whisker plot that compares both ‘smokers’ and ‘nonsmokers’ datasets using this code:

boxPlot_SvN <- boxplot(nonsmokers, smokers, main = “Child Birthweights for

Nonsmoking Mothers and Smoking Mothers”, at = c(1, 2), names = c(“Non-S”,

“Smokers”), xlab = “Birthweight in Ounces”, horizontal = TRUE, notch = TRUE)

This plot will show the locations, dispersions, and outliers of the birth weights of babies to smoking mothers and non smoking mothers fairly well. It may also give a sense for the skewness and the tail size. The second figure that is to be shared is a Normal Distribution graph that also compares the same two datasets:

ggplot(data = data.frame(smokers = c(30, 200)),

mapping = aes(x = smokers)) +

stat_function(mapping = aes(colour = “Smoking Mothers”),

fun = dnorm,

args = list(mean = mean(smokers),

sd = sd(smokers))) +

stat_function(mapping = aes(colour = “Nonsmoking Mothers”),

fun = dnorm,

args = list(mean = mean(nonsmokers),

sd = sd(nonsmokers))) +

scale_colour_manual(values = c(“blue”, “red”)) +

labs(x = “in Ounces”,

y = “Probabilities”,

title = “Normal Curves for Child Birthweights”)

This graph will be beneficial for displaying the mean and standard deviation of both datasets. The skewness and kurtosis of both datasets will be clearly seen and comparable.

Results: Here is a table sharing all of the values of the relevant and descriptive statistics described above.

Table:

Descriptive Statistic	Smoking Mother	Non Smoking Mother
Mean	114.1095	123.0472
Median	115	123
Standard Deviation	18.09895	17.39869
IQR	24	21
Minimum	58	55
Maximum	163	176
Z-3	59.81265	70.85113
Z-2	77.9116	88.24982
Z-1	96.01055	105.64851
Z0	114.1095	123.0472
Z1	132.20845	140.44589
Z2	150.3074	157.84458
Z3	168.40635	175.24327
Skewness	-5.6489e-6	-3.543044e-5
Kurtosis	2.7732e-5	4.393673e-5
Probability of Underweight (88.2 ounces)	7.6136%	2.2596%

Box and Whisker Plot:

Normal Distribution:

Discussion and Conclusion: The question that this study looked at is if there is a difference between the birth weight of babies born to smoking mothers and non smoking mothers, and if so, is the difference significantly important? There are a couple claims that are being tested, that is, is smoking responsible for a 5.3 to 8.8 ounce reduction in birth weight, and are mothers who smoke about twice as likely as non smoking mothers to have a low birth-weight baby (under 88.2 ounces)?

If we look at the data, there is a mean difference of 8.9377 ounces between babies whose mothers did not smoke through their pregnancy and those who did not. For the given data, this first claim would appear to be true. Very true. As the mean difference is at the higher end of the plausible range given from the claim.

As to if mothers who smoke are about twice as likely as non smoking mothers to have a low birth-weight baby (under 88.2 ounces), we look at the calculated statistics. We see that in the table, babies born to smoking mothers are 7.6136% likely to be born underweight. Compared to babies born to non smoking mothers who are 2.2596% likely to be born underweight. According to the given data, babies of non smoking mothers are 3.37x less likely to be born underweight than babies born to mothers who smoked. This difference is about 70% larger than the original claim made in the abstract that they would be at least 2x less likely.

One factor that seemingly would be important, yet unavailable, to this study would be to include the ages of all the pregnant women. Certainly the age of a woman has some effect on the child’s health and birth weight. It would be curious to see how this extra variable, conditional with if the mother smokes or not, would affect the data and the results. However, we have affirmed with the given data that mothers who smoke seem to be responsible for at least a 5.3 to 8.8 ounce reduction in birth weight, and such mothers are about twice as likely as non smoking mothers to have a baby born underweight (88.2 ounces or less).