Statistical Report 4

Statistical Report 4Payton McCarthy

Professor Davis

MTH 332

5 May, 2020

The Search for the Origin of the Replication of Palindromes within CMV

 

Abstract:  This study looks to compare the structures in the DNA for clusters of palindromes.  The search is for the origin of the replication of the palindromes within the DNA of the human cytomegalovirus (CMV), a potentially life-threatening disease.  The data is from M.S. Chee who published the DNA sequence of CMV in 1990, and also from M.Y. Leung who was able to implement search algorithms in a computer program in order to screen the DNA sequence for many types of patterns. 

 

Intro and Background:  DNA like a coded message that uses a four-letter alphabet, that is A, C, T, and G.  The order of which these four letters is important, as patterns or repetitions of these letters within the DNA could be of significance, such as where the origin of replication within the DNA is.  One type of pattern of focus here is that of a complementary palindrome.  The letter-pairs of A-T and C-G are complementary to each other.  A complementary palindrome is a sequence of letters that when in reverse is the ‘complement’ of the forward sequence.  AT and TA are both complementary palindromes, short ones at that.  CATG, GTAC, ACGT, and TGCA are also all complementary palindromes.  In this study, there were 296 palindrome sequences that were at least 10 letters long found, the longest being 18 letters long.  The entire CMV DNA molecule contains 229,354 complementary pairs of letters or base pairs.  It is freaking huge.

 

Methods:  I remember I had problems starting with this assignment where I had to go to your office for help.  We figured it out eventually, but afterwards was Spring Break and then the nationwide quarantine occurred.  It has been a struggle to keep up with everything on my plate since all of these drastic changes, but I will share what I was able to record.

To download the data, I used the code

data <- read.table(“hcmv.txt”, header = FALSE, sep = “,”)

data <- as.data.frame(t(data))

to label the dataset as ‘data’.  I decided to define some variables to play with by defining L, 

L <- c()

  k <- 0

  while(k<=56){

    L <- c(L, sum(data >= k * 4000 + 1, data <= (k+1)*4000))

  k <- k+1}

I then followed this by defining ‘counts’ and ‘lambda’ as,

counts <- as.data.frame(table(L))

lambda <- (294/57)

Now I defined the equation for the Poisson Distribution as:

PoissonDistribution <- function(j) {

    i <- 0

    PDvalues <- list()

    while (i <= j)

    {

      k_points_sum <- sum(lambda ^ i  /  factorial(i))

      PDvalue <- k_points_sum * exp(-lambda)

      PDvalues[[i+1]] <- PDvalue

      i <- i+1}

    return(sum(PDvalues))}

Unfortunately, this was all I was able to put together.

Discussion and Conclusion:  Since I was not able to fully complete this assignment, I do not have much conclusive results to share.  

Statistical Report 3

Statistical Report 3

Payton McCarthy

Professor Davis

MTH 332

21 February, 2020

 

Analysis of Carapace Sizes Before and After the Molting of Female Dungeness Crabs

 

Abstract:  This study looks to compare collected data of the various sizes of carapaces (shells) from female Dungeness crabs.  The data was taken after the year 1983 on the Pacific coast of North America. It was collected by the California Department of Fish and Game and various commercial crab fishers.  This study aims to examine the relationship between premolt and postmolt carapace sizes and summarize the results both numerically and graphically.

 

Intro and Background:   The Dungeness Crab is one of the largest and most abundant crabs on the Pacific coast.  When it grows, the crab molts periodically, casting off its shell and growing a new one.  It is reportedly easy to tell if a crab has molted recently by how clean its shell is. All measurements are in millimeters.

 

Methods:  This study uses the computer program RStudio to make statistical calculations about the dataset.  After downloading the data ‘crabs.data’ from the course website, use this code

crabsizes <- read.table(“crabs.data”,header=TRUE,sep=””)

to label the dataset as ‘crabsizes’.  

Then I separated the data into two different lists, one of premolt sizes and and the other of postmolt sizes.  

pre_crab_list <- pull(crabsizes, var = -5)

post_crab_list <- pull(crabsizes, var = -4)

I then calculated all of the descriptive statistics for both lists of values above by the following,

mean_pre_crab_list <- mean(pre_crab_list)

median_pre_crab_list <- median(pre_crab_list)

SD_pre_crab_list <- sd(pre_crab_list)

IQR_pre_crab_list <- IQR(pre_crab_list)

min_pre_crab_list <- min(pre_crab_list)

max_pre_crab_list <- max(pre_crab_list)

 

mean_post_crab_list <- mean(post_crab_list)

median_post_crab_list <- median(post_crab_list)

SD_post_crab_list <- sd(post_crab_list)

IQR_post_crab_list <- IQR(post_crab_list)

min_post_crab_list <- min(post_crab_list)

max_post_crab_list <- max(post_crab_list)

 

z_pre_crab_list <- (pre_crab_list – mean(pre_crab_list))/sd(pre_crab_list)

z_post_crab_list <- (post_crab_list – mean(post_crab_list))/sd(post_crab_list)

 

skewness_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^3) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 3)) )

skewness_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^3) / (length(post_crab_list)*(sd(post_crab_list) ^ 3)) )

 

kurtosis_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^4) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 4)) )

kurtosis_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^4) / (length(post_crab_list)*(sd(post_crab_list) ^ 4)) )

I then created a Normal Distribution graph for these two lists with,

ggplot(data = data.frame(pre_crab_list = c(30, 200)),

       mapping = aes(x = pre_crab_list)) +

  stat_function(mapping = aes(colour = “Premolted Carapace Sizes”),

                fun = dnorm,

                args = list(mean = mean(pre_crab_list),

                            sd = sd(pre_crab_list))) +

  stat_function(mapping = aes(colour = “Postmolted Carapace Sizes”),

                fun = dnorm,

                args = list(mean = mean(post_crab_list),

                            sd = sd(post_crab_list))) +

  scale_colour_manual(values = c(“blue”, “red”)) +

  labs(x = “Carapace Sizes”,

       y = “Probabilities”,

       title = “Normal Distributions of Premolt Sizes v Postmolt Sizes”)

followed by two histogram plots for both lists as well.

hist(pre_crab_list, 

     main=”Histogram of Premolt Sizes”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”darkorchid1″,

     xlim=c(30,160),

     las=1, 

     breaks=100)

 

hist(post_crab_list, 

     main=”Histogram of Postmolt Sizes”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”darkslateblue”,

     xlim=c(30,180),

     las=1, 

     breaks=100)

 

These were just simply the entire lists of premolt and postmolt carapace sizes, but now this study will look to focus more on whether these shells grew in the field or in the lab, and look at any statistical differences between such.  This was done so by splitting the original datatable into two, one of crabs molting in the field and the other of crabs molting in the lab.

 

field_crabs <- filter(crabsizes, lf == 0)

lab_crabs <- filter(crabsizes, lf == 1)

These two tables were each stripped down to a two lists each: Premolt sizes in the field, postmolt sizes in the field, premolt sizes in the lab, and postmolt sizes in the lab.

pre_field_crabs <- pull(field_crabs, var = -5)

post_field_crabs <- pull(field_crabs, var = -4)

pre_lab_crabs <- pull(lab_crabs, var = -5)

post_lab_crabs <- pull(lab_crabs, var = -4)

Now all the descriptive statistics for each of these four lists can be found fairly simply by:

mean_pre_field_crabs <- mean(pre_field_crabs)

median_pre_field_crabs <- median(pre_field_crabs)

SD_pre_field_crabs <- sd(pre_field_crabs)

IQR_pre_field_crabs <- IQR(pre_field_crabs)

min_pre_field_crabs <- min(pre_field_crabs)

max_pre_field_crabs <- max(pre_field_crabs)

 

mean_post_field_crabs <- mean(post_field_crabs)

median_post_field_crabs <- median(post_field_crabs)

SD_post_field_crabs <- sd(post_field_crabs)

IQR_post_field_crabs <- IQR(post_field_crabs)

min_post_field_crabs <- min(post_field_crabs)

max_post_field_crabs <- max(post_field_crabs)

 

mean_pre_lab_crabs <- mean(pre_lab_crabs)

median_pre_lab_crabs <- median(pre_lab_crabs)

SD_pre_lab_crabs <- sd(pre_lab_crabs)

IQR_pre_lab_crabs <- IQR(pre_lab_crabs)

min_pre_lab_crabs <- min(pre_lab_crabs)

max_pre_lab_crabs <- max(pre_lab_crabs)

 

mean_post_field_crabs <- mean(post_lab_crabs)

median_post_field_crabs <- median(post_lab_crabs)

SD_post_field_crabs <- sd(post_lab_crabs)

IQR_post_field_crabs <- IQR(post_lab_crabs)

min_post_field_crabs <- min(post_lab_crabs)

max_post_field_crabs <- max(post_lab_crabs)

 

z_pre_field_crabs <- (pre_field_crabs – mean(pre_field_crabs))/sd(pre_field_crabs)

z_post_field_crabs <- (post_field_crabs – mean(post_field_crabs))/sd(post_field_crabs)

z_pre_lab_crabs <- (pre_lab_crabs – mean(pre_lab_crabs))/sd(pre_lab_crabs)

z_post_lab_crabs <- (post_lab_crabs – mean(post_lab_crabs))/sd(post_lab_crabs)

 

skewness_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^3) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 3)) )

skewness_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^3) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 3)) )

skewness_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^3) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 3)) )

skewness_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^3) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 3)) )

 

kurtosis_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^4) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 4)) )

kurtosis_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^4) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 4)) )

kurtosis_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^4) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 4)) )

kurtosis_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^4) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 4)) )

The next thing performed was to display all this important statistical information in a visual way that would aid in understanding the results.  The first way this is done is by looking at the Normal Curves of all four lists to compare them easily:

ggplot(data = data.frame(pre_field_crabs = c(75, 200)),

       mapping = aes(x = pre_field_crabs)) +

  stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Field”),

                fun = dnorm,

                args = list(mean = mean(pre_field_crabs),

                            sd = sd(pre_field_crabs))) +

  stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Field”),

                fun = dnorm,

                args = list(mean = mean(post_field_crabs),

                            sd = sd(post_field_crabs))) +

  stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Lab”),

                fun = dnorm,

                args = list(mean = mean(pre_lab_crabs),

                            sd = sd(pre_lab_crabs))) +

  stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Lab”),

                fun = dnorm,

                args = list(mean = mean(post_lab_crabs),

                            sd = sd(post_lab_crabs))) +

  scale_colour_manual(values = c(“darkgreen”, “red”, “chartreuse3”, “darkgoldenrod1”)) +

  labs(x = “Carapace Sizes”,

       y = “Probabilities”,

       title = “Normal Distributions of (Premolt Sizes v Postmolt Sizes) v (In the Field v In the Lab)”)

Next I looked to create separate histograms for each of these lists:

hist(pre_field_crabs, 

       main=”Histogram of Premolt Sizes in the Field”, 

       xlab=”Carapace Sizes”, 

       border=”black”, 

       col=”chartreuse”,

       xlim=c(110,160),

       las=1, 

       breaks=100)

 

hist(pre_lab_crabs, 

       main=”Histogram of Premolt Sizes in the Lab”, 

       xlab=”Carapace Sizes”, 

       border=”black”, 

       col=”darkgoldenrod1″,

       xlim=c(30,160),

       las=1, 

       breaks=100)

 

hist(post_field_crabs, 

     main=”Histogram of Postmolt Sizes in the Field”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”darkgreen”,

     xlim=c(120,170),

     las=1, 

     breaks=100)

 

hist(post_lab_crabs, 

     main=”Histogram of Postmolt Sizes in the Lab”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”red”,

     xlim=c(30,170),

     las=1, 

     breaks=100)

 

Now this is where a procedure is developed for predicting a crab’s premolt size from its postmolt size, where the intention to derive an expression is displayed.  First, set the means of the total premolt and postmolt lists be equal to y and x, respectively. Finally, the residual standard deviation can be calculated (SD_r).

y <- mean_pre_crab_list

x <- mean_post_crab_list

Then, the correlation coefficient (r) can be calculated, which will help us to find the slope (b_hat) of the regression line.  Next is to solve for b_hat, and then a_hat.

r <- (1/length(post_crab_list)) * sum(((post_crab_list – x) / SD_post_crab_list) * ((pre_crab_list – y) / SD_pre_crab_list))

b_hat <- r * (SD_pre_crab_list / SD_post_crab_list)

a_hat <- b_hat * x – y

SD_r <- sqrt(1 – r^2) * sd(pre_crab_list)

The regression line prediction is thus:   yi = b_hat * xi  + a_hat.

To test if this equation is accurate, it was decided to do so on three different groups of postmolt sizes, that is the first group being all between 147.5mm and 152.5mm, the second group between 142.5mm and 147.5mm, and the last group being between 152.5mm and 157.5mm.  The first block of code filters all the rows into 3 different tables according to the specified ranges above. The second and third blocks of code pulls out the two columns containing those postmolt and premolt sizes from those three tables, respectively.

group1_crabsizes <- filter(crabsizes, postsz >= 147.5, postsz <= 152.5)

group2_crabsizes <- filter(crabsizes, postsz >= 142.5, postsz <= 147.5)

group3_crabsizes <- filter(crabsizes, postsz >= 152.5, postsz <= 157.5)

 

group1_postmolt <- pull(group1_crabsizes, var = -4)

group2_postmolt <- pull(group2_crabsizes, var = -4)

group3_postmolt <- pull(group3_crabsizes, var = -4)

 

group1_premolt <- pull(group1_crabsizes, var = -5)

group2_premolt <- pull(group2_crabsizes, var = -5)

group3_premolt <- pull(group3_crabsizes, var = -5)

Now to write out the calculations for what each expected premolt size should be (y_hat#), and compare that to what it actually is (group#_premolt).

y_hat1 <- b_hat * group1_postmolt  – a_hat

y_hat2 <- b_hat * group2_postmolt  – a_hat

y_hat3 <- b_hat * group3_postmolt  – a_hat

Compare now by finding each test statistic value:

test_statistic1 <- sum((y_hat1 – mean(group1_premolt))^2) / mean(group1_premolt)

test_statistic2 <- sum((y_hat2 – mean(group2_premolt))^2) / mean(group2_premolt)

test_statistic3 <- sum((y_hat3 – mean(group3_premolt))^2) / mean(group3_premolt)

 

Results:  Here is a table sharing all of the values of the relevant and descriptive statistics described above.  It has been color coordinated from green to red (green green-yellow, yellow, orange, and red). If a number is green then it is the largest in its row (descriptive statistic), and if it is red then it is the smallest in its row, and everything else in between.

Table:

Descriptive Statistic Premolt Sizes Postmolt Sizes Premolt Sizes – Field Postmolt Sizes – Field Premolt Sizes – Lab Postmolt Sizes – Lab
Mean 129.21186 143.89767 139.00901 152.96396 126.19945 141.10997
Median 132.8 154 140.1 154 128.9 143.7
Standard Deviation 15.86452 14.64060 7.25115 6.71997 16.56878 15.28078
IQR 18.325 15.45 7.70000 7 18.1 15.6
Minimum 31.1 38.8 113.6 127.7 31.1 38.8
Maximum 155.1 166.8 153.9 166.5 155.1 166.8
Z-3 81.6183 99.97587 117.25556 132.80405 76.49311 95.26763
Z-2 97.48282 114.61647 124.50671 139.52402 93.06189 110.54841
Z-1 113.34734 129.25707 131.75786 146.24399 109.63067 125.82919
Z0 129.21186 143.89767 139.00901 152.96396 126.19945 141.10997
Z1 145.07638 158.53827 146.26016 159.68393 142.76823 156.39075
Z2 160.9409 173.17887 153.51131 166.4039 159.33701 171.67153
Z3 176.80542 187.81927 160.76246 173.12387 175.90579 186.95231
Skewness -1.99712 -2.33945 -1.09590 -1.10398 -1.88109 -2.27861
Kurtosis 9.72498 13.06052 4.67604 5.14670 8.97447 12.37337

 

A comparative look now at the Normal Distributions for all premolt carapace shell sizes against all postmolt carapace sizes will help to paint a clear, visual picture.  It can be seen that postmolted carapace shells clearly have a larger probability of being larger than if they were premolted, but that is just common sense.  

It also could be nice to compare the histrograms of both of these distributions.

It can be seen that this tells the same story.  Now this study will look at the visual results of whether these shells grew relatively larger in the field or in the lab.  First the Normal Distributions and then each of the four histograms.

It can be clearly noted that, when in the field, the carapace sizes have a much higher probability of being larger (premolt and postmolt) rather than when they are observed to molt in the lalb (premolt or postmolt).

The expression for predicting a crab’s premolt size from its postmolt size that was derived in the methods section is now brought forth here: 

yi = b_hat * xi  + a_hat

From the calculations made in the program, b_hat = 1.07 and a_hat = 24.89.  Therefore, my regression line prediction expression is yi = 1.07* xi  + 24.89.

For any postmolt size (xi) that is given, multiply it by 1.07 and then add 24.89 and you will receive a fairly accurate prediction for that crabs premolt size (yi).

The residual SD (SD_r) was calculated to be about 2.42mm.  This means that when using the regression prediction equation above, about 68% of yi-values will fall within about 2.42mm above or below its prediction.  And about 95% of yi-values will fall within about 4.85mm above or below its appropriate prediction.  Finally, the values for each test statistic calculated between the expected and the observed differences in premolt sizes is:

test_statistic1 = 2.0489

test_statistic2 = 1.437982

test_statistic3 = 1.608636

Discussion and Conclusion:  Looking at the values for the test statistics, they are relatively low, which appears to mean that the observed values and the expected values are so close that the fit seems appropriate.  Therefore, this study completed what it set out to do, providing numerical and graphical data and also providing an expression that could predict accurately to a degree the premolt carapace size from only its postmolt carapace size.

Statistical Report 2

Statistical Report 2

 

Payton McCarthy

Professor Davis

MTH 332

13 February, 2020

 

Analysis of Who Plays Video Games

 

Abstract:  This study looks to compare data of the random surveys taken from students from the University of California, Berkeley.   These students were enrolled in Statistics 2 during the Fall of 1994. This survey aims to look at the reported frequency of play from these students, whether the students like to play video games or not and why, and then a collection of general information about the students.

 

Intro and Background:   There was an exam given the week before the survey was taken.   314 students had taken the exam, and only 95 of them were chosen randomly for the survey, 91 of whom had responded.  The objective of this study is to explore the responses of the students in the survey with the intent of providing useful information to others.

 

Methods:  This study uses the computer program RStudio to make statistical calculations about the dataset.  After downloading the data ‘video.data’ from the course website, use this code

data <- read.table(“video.data”,header=TRUE,sep=””)

to label the dataset as ‘data’.  Then I quickly labeled all the constants given to us,

N <- 314

n <- 91

nonrespondents <- 4

Then I separated the data into two different datasets, based on the first column “time”.  This is simply the number of hours played in the week prior to survey, this data was separated into two lists of all the time-values of the students who did play in the last week, and the time-values of all the students within the last week.  This was done so that statistical calculations about each of these two separate groups can be made. Do this by using 

stu_DidPlay <- filter(data, time != 0)

stu_DidPlay_values <- pull(stu_DidPlay, var = -15)

 

values_time <- pull(data, var = -15)

The fractions of who did play can be calculated by

fraction_DidPlay <- length(stu_DidPlay_values) / n

Now all the descriptive statistics can be found fairly simply by simply computing them:

x_DidPlay <- mean(stu_DidPlay_values)

x_time <- mean(values_time)

 

Median_DidPlay  <- median(stu_DidPlay_values)

Median_time <- median(values_time)

 

SD_DidPlay <- sd(stu_DidPlay_values)

SD_time <- sd(values_time)

 

IQR_DidPlay  <- IQR(stu_DidPlay_values)

IQR_time <- IQR(values_time)

 

min_DidPlay  <- min(stu_DidPlay_values)

min_time <- min(values_time)

max_DidPlay  <- max(stu_DidPlay_values)

max_time <- max(values_time)

Next is to find the z-scores, skewness, and kurtosis for both groups.  This is done by calculating the z-score for each value in both lists of students who did play and all students:

z_DidPlay  <- (stu_DidPlay_values – mean(stu_DidPlay_values))/sd(stu_DidPlay_values)

z_time <- (values_time – mean(values_time))/sd(values_time)

These new lists are comprised of all the z-scores for all the time-values for both groups.  Now the skewness and kurtosis can be calculated by:

skewness_DidPlay <- sum(z_DidPlay ^ 3) / (length(z_DidPlay) * (sd(stu_DidPlay_values)) ^ 3)

skewness_time <- sum(z_time ^ 3) / (length(z_time) * (sd(values_time)) ^ 3)

kurtosis_DidPlay <- sum(z_DidPlay ^ 4) / (length(z_DidPlay) * (sd(stu_DidPlay_values)) ^ 4)

kurtosis_time <- sum(z_time ^ 4) / (length(z_time) * (sd(values_time)) ^ 4)

We now want to calculate a 95% confidence interval for this data.  I did so for both datasets:

lower_2SDinterval_DidPlay <- (x_DidPlay – (2 * SD_DidPlay)) / sqrt(n)

upper_2SDinterval_DidPlay <- (x_DidPlay + (2 * SD_DidPlay)) / sqrt(n)

 

lower_2SDinterval_time <- (x_time – (2 * SD_time)) / sqrt(n)

upper_2SDinterval_time <- (x_time + (2 * SD_time)) / sqrt(n)

 

The next thing performed was to display all the important statistical information in a visual way that would further aid in understanding the results.  The first way this is done is by looking at the Normal Curves of both datasets and compare them easily, this was done using this code:

ggplot(data = data.frame(stu_DidPlay_values = c(0, 30)),

       mapping = aes(x = stu_DidPlay_values)) +

    stat_function(mapping = aes(colour = “Students Who Did Play”),

                  fun = dnorm,

                  args = list(mean = mean(stu_DidPlay_values),

                              sd = sd(stu_DidPlay_values))) +

    stat_function(mapping = aes(colour = “All Students”),

                  fun = dnorm,

                  args = list(mean = mean(values_time),

                              sd = sd(values_time))) +

    scale_colour_manual(values = c(“blue”, “red”)) +

    labs(x = “Hours Played Last Week”,

         y = “Probabilities”,

         title = “Normal Curves for Hours Played Last Week Between Groups”)

 

Next I looked at the second column in the ‘video.data’ labeled “like”.  This data tells us how much the student likes to play. 1 = ever played; 2 = very much; 3 = somewhat; 4 = not really; 5 = not at all.  It was important to filter out the data where a 99 was used, as this would skew the data:

stu_like <- filter(data, like != 99)

stu_like_values <- pull(stu_like, var = -14)

The descriptive stats for this dataset were found by,

x_stu_like_values <- mean(stu_like_values)

Median_stu_like_values <- median(stu_like_values)

SD_stu_like_values <- sd(stu_like_values)

IQR_stu_like_values <- IQR(stu_like_values)

min_stu_like_values <- min(stu_like_values)

max_stu_like_values <- max(stu_like_values)

 

z_stu_like_values <- (stu_like_values – mean(stu_like_values))/sd(stu_like_values)

 

skewness_stu_like_values <- sum(z_stu_like_values ^ 3) / ((length(z_stu_like_values) * (sd(stu_like_values)) ^ 3)

kurtosis_stu_like_values <- sum(z_stu_like_values ^ 4) / ((length(z_stu_like_values) * (sd(stu_like_values)) ^ 4)

 

Then a second figure was made, this was another Normal Distribution graph that shows the probability of how much a student likes to play:

ggplot(data = data.frame(stu_like_values = c(1, 5)),

       mapping = aes(x = stu_like_values)) +

    stat_function(mapping = aes(colour = “Who Likes to Play”),

                  fun = dnorm,

                  args = list(mean = mean(stu_like_values),

                              sd = sd(stu_like_values))) +

    labs(x = “How Much Students Like to Play”,

         y = “Probabilities”,

         title = “Normal Curve for How Much Students Like to Play”)

This kind of graph is beneficial for displaying the mean and standard deviation simply.  The skewness and kurtosis can be clearly seen and is comparable.

 

Next thing done was to look at the frequency of play, that is, how often do these students normally play video games?  1 = daily; 2 = weekly; 3 = monthly; 4 = semesterly,

freq_play <- filter(data, freq != 99)

freq_play_values <- pull(freq_play, var = -12)

Descriptive stats,

x_freq_play_values <- mean(freq_play_values)

Median_freq_play_values <- median(freq_play_values)

SD_freq_play_values <- sd(freq_play_values)

IQR_freq_play_values <- IQR(freq_play_values)

min_freq_play_values <- min(freq_play_values)

max_freq_play_values <- max(freq_play_values)

 

z_freq_play_values <- (freq_play_values -mean(freq_play_values)/sd(freq_play_values))

 

skewness_freq_play_values <- sum(freq_play_values ^ 3) / (length(z_freq_play_values) * (sd(freq_play_values)) ^ 3)

kurtosis_freq_play_values <- sum(freq_play_values ^ 4) / (length(z_freq_play_values) * (sd(freq_play_values)) ^ 4)

 

Finally, a Normal Distribution was made for the students reported frequency of play,

ggplot(data = data.frame(freq_play_values = c(1, 4)),

       mapping = aes(x = freq_play_values)) +

    stat_function(mapping = aes(colour = “How Often Will Play”),

                  fun = dnorm,

                  args = list(mean = mean(freq_play_values),

                              sd = sd(freq_play_values))) +

    labs(x = “How Often Students Will Play”,

         y = “Probabilities”,

         title = “Normal Curve for Frequency of Play”)

 

Results:  Here is a table sharing all of the values of the relevant and descriptive statistics described above.

Table:

Descriptive Statistic Hours Played by Students Who Did Play Hours Played by All Students If Students Like to Play 

(1-5)

Reported Frequency of Play of Students

(1-4)

Mean 3.32647 1.24286 3.02222 2.70513
Median 2 0 3 3
Standard Deviation 5.63616 3.77704 0.87381 1.02068
IQR 1 1.25 1 2
Minimum 0.1 0 1 1
Maximum 30 30 5 4
Z-3 0.40079
Z-2 1.2746 0.66377
Z-1 2.14841 1.68445
Z0 3.32647 1.24286 3.02222 2.70513
Z1 8.96263 5.0199 3.89603 3.72581
Z2 14.59879 8.79694 4.76984
Z3 20.23495 12.57398
Skewness 0.01942 0.10566 0.83539 0.11051
Kurtosis 0.01539 0.195606 5.25736 1.75250

 

The fraction of students who played video games the week before the survey is about 37.4%.

The estimated 95% confidence interval for the amount of hours spent by students who did play the week before is (-0.833, 1.530), and for the amount of hours spent by all students in general is (-0.662, 0.922).  A look at the Normal Distributions for these confidence intervals helps to paint a clearer picture:

It is impossible for students to have a negative amount of hours spent playing video games, which is visually shown by the graph above.  Therefore, the confidence intervals are better stated to be (0, 1.530) and (0, 0.922), respectively.

We can compare this with the reported frequency of play and see that students tend to lean towards 3 more than 2, that is, they play about monthly rather more so than weekly.  According to the table, all students across the board have an average frequency of play of 2.7 which seemingly corresponds to a sample average of 1.25 hours of gaming the week previous to the survey across all students, and an average of 3.3 hours of gaming across all students who actually played.

If we look at the final Normal Distribution, for how much students actually like to play, we will see that it looks far more standard.  This would indeed seem to indicate that the students frequency of play may have been affected by the exam that was during the week previous to the survey.

 

Discussion and Conclusion:  In this final part, we can look at some of the “attitude polls” from the students.  By counting, we can easily see that 23/90 of the reported students ‘very much’ like to play video games, 46/90 ‘somewhat’ like to play, and 21//90 are ‘not really’ or ‘not at all’ into playing video games.  

The reasons for such vary.  However, some 72 participants gave reasons why they do like to play video games, as reported doing so mostly to relax, a 66/72 = 91.7%.  Some also reported having secondary or even tertiary reasons they like to play, like 38.9% like the ‘feeling of mastery’, 37.5% play because they are ‘bored’, 36.1% play for the ‘graphics/realism’ that video games offer, and 33.3% like the ‘mental challenge’ that video games can present.

Some 83 participants also reported why they do not like to play video games.  The two majority reasons being that it takes up ‘too much time’ at 57.8% and that it ‘costs too much’ at 48.2%.  This is followed up by feelings of it being ‘pointless’ and ‘frustrating’, 39.8% and 31.3%, respectively.  

Statistical Report 1

Statistical Report 1

 

Payton McCarthy

Professor Davis

MTH 332

7 February, 2020

Does Smoking Affect a Child’s Birth Weight?

 

Abstract:  This study looks to compare data of the birth weights of babies from mothers who smoke and from mothers who do not.  The data is from the Child Health and Development Studies (CHDS), which includes all pregnancies from 1960-1967 among women in the Kaiser Foundation Health Plan in Oakland, California.  There is a claim that smoking is responsible for a 5.3 to 8.8 ounce reduction in birth weight. In addition, smoking mothers are about twice as likely as non smoking mothers to have a low birth-weight baby (under 88.2 ounces).

 

Intro and Background:  There are many health care practitioners and health organizations that have given strong warnings to pregnant mothers who smoke.  However, not everyone has taken these warnings seriously, for example, it was found in 1996 that 15% of pregnant women smoked through their pregnancy (National Center for Health Statistics).  This study intends to look at the question if there is a difference between the birth weight of babies born to smoking mothers and non smoking mothers, and if so, is the difference significantly important?

 

Methods:  When this data was recorded, the measurements of each of the 1236 babies were taken at birth.  The weights were collected and measured in ounces. The smoking status of the mother was either tallied as a 1 or a 0 to denote if they were smoking or nonsmoking respectively.  This study uses the computer program RStudio to make statistical calculations about the dataset.  After downloading the data ‘babiesI.data’ from the course website, use this code

data <- read.table(“babiesI.data”,header=TRUE,sep=””)

to label the dataset as ‘data’.  Then you want to be able to separate the data into two different datasets of smokers and non smokers so that you can make statistical calculations about each of these two separate groups.  Do this by using 

smokers <- filter(data, smoke == 1)

smokers <- pull(smokers, var = -2)

to create a list of birth weights of babies born to smoking mothers referred to as ‘smokers’.  Do the same for the birth weights of babies born to non smoking mothers and refer to this as ‘nonsmokers’.

nonsmokers <- filter(data, smoke == 0)

nonsmokers <- pull(nonsmokers, var = -2)

Now all the descriptive statistics can be found fairly simply by simply computing them:

mean(smokers)

median(smokers)

sd(smokers)

IQR(smokers)

min(smokers)

max(smokers)

then,

mean(nonsmokers)

median(nonsmokers)

sd(nonsmokers)

IQR(nonsmokers)

min(nonsmokers)

max(nonsmokers)

Next is to find the z-scores, skewness, and kurtosis for both groups.  This is done by calculating the z-score for each value in both lists of smokers and non smokers:

z_smokers <- (smokers – mean(smokers))/sd(smokers)

z_nonsmokers <- (nonsmokers – mean(nonsmokers))/sd(nonsmokers)

These new lists are comprised of all the z-scores for all the corresponding birth weights of smokers and non smokers.  Now the skewness and kurtosis can be calculated by:

skewness_smokers <-sum(z_smokers ^ 3) / (length(z_smokers) * (sd(smokers)) ^ 3)

kurtosis_smokers <- sum(z_smokers ^ 4) / (length(z_smokers) * (sd(smokers)) ^ 4)

skewness_nonsmokers <- sum(z_nonsmokers ^ 3) / (length(z_nonsmokers) * (sd(nonsmokers)) ^ 3)

kurtosis_nonsmokers <- sum(z_nonsmokers ^ 4) / (length(z_nonsmokers) * (sd(nonsmokers)) ^ 4)

We now want to calculate the probability that a baby could be born underweight (88.2 ounces or below) as mentioned in the abstract.  We want to do so for babies born to mothers who smoked and mothers who did not.

underweight_smokers <- (88.2 – mean(smokers))/sd(smokers)

underweight_nonsmokers <- (88.2 – mean(nonsmokers))/sd(nonsmokers)

The next thing performed was to display all the important statistical information in a visual way that would further aid in understanding the results.  The first way this is done is by and Box and Whisker plot that compares both ‘smokers’ and ‘nonsmokers’ datasets using this code:

boxPlot_SvN <- boxplot(nonsmokers, smokers, main = “Child Birthweights for 

Nonsmoking Mothers and Smoking Mothers”, at = c(1, 2), names = c(“Non-S”, 

“Smokers”), xlab = “Birthweight in Ounces”, horizontal = TRUE, notch = TRUE)

This plot will show the locations, dispersions, and outliers of the birth weights of babies to smoking mothers and non smoking mothers fairly well.  It may also give a sense for the skewness and the tail size. The second figure that is to be shared is a Normal Distribution graph that also compares the same two datasets:

ggplot(data = data.frame(smokers = c(30, 200)),

         mapping = aes(x = smokers)) +

     stat_function(mapping = aes(colour = “Smoking Mothers”),

                  fun = dnorm,

                  args = list(mean = mean(smokers),

                              sd = sd(smokers))) +

     stat_function(mapping = aes(colour = “Nonsmoking Mothers”),

                  fun = dnorm,

                  args = list(mean = mean(nonsmokers),

                              sd = sd(nonsmokers))) +

     scale_colour_manual(values = c(“blue”, “red”)) +

     labs(x = “in Ounces”,

         y = “Probabilities”,

         title = “Normal Curves for Child Birthweights”)

This graph will be beneficial for displaying the mean and standard deviation of both datasets.  The skewness and kurtosis of both datasets will be clearly seen and comparable.

 

Results:  Here is a table sharing all of the values of the relevant and descriptive statistics described above.

Table:

Descriptive Statistic Smoking Mother Non Smoking Mother
Mean 114.1095 123.0472
Median 115 123
Standard Deviation 18.09895 17.39869
IQR 24 21
Minimum 58 55
Maximum 163 176
Z-3 59.81265 70.85113
Z-2 77.9116 88.24982
Z-1 96.01055 105.64851
Z0 114.1095 123.0472
Z1 132.20845 140.44589
Z2 150.3074 157.84458
Z3 168.40635 175.24327
Skewness -5.6489e-6 -3.543044e-5
Kurtosis 2.7732e-5 4.393673e-5
Probability of Underweight (88.2 ounces) 7.6136% 2.2596%

Box and Whisker Plot:

Normal Distribution:

Discussion and Conclusion:  The question that this study looked at is if there is a difference between the birth weight of babies born to smoking mothers and non smoking mothers, and if so, is the difference significantly important?  There are a couple claims that are being tested, that is, is smoking responsible for a 5.3 to 8.8 ounce reduction in birth weight, and are mothers who smoke about twice as likely as non smoking mothers to have a low birth-weight baby (under 88.2 ounces)?

If we look at the data, there is a mean difference of 8.9377 ounces between babies whose mothers did not smoke through their pregnancy and those who did not.  For the given data, this first claim would appear to be true. Very true. As the mean difference is at the higher end of the plausible range given from the claim.

As to if mothers who smoke are about twice as likely as non smoking mothers to have a low birth-weight baby (under 88.2 ounces), we look at the calculated statistics.  We see that in the table, babies born to smoking mothers are 7.6136% likely to be born underweight. Compared to babies born to non smoking mothers who are 2.2596% likely to be born underweight.  According to the given data, babies of non smoking mothers are 3.37x less likely to be born underweight than babies born to mothers who smoked. This difference is about 70% larger than the original claim made in the abstract that they would be at least 2x less likely.

One factor that seemingly would be important, yet unavailable, to this study would be to include the ages of all the pregnant women.  Certainly the age of a woman has some effect on the child’s health and birth weight. It would be curious to see how this extra variable, conditional with if the mother smokes or not, would affect the data and the results.  However, we have affirmed with the given data that mothers who smoke seem to be responsible for at least a 5.3 to 8.8 ounce reduction in birth weight, and such mothers are about twice as likely as non smoking mothers to have a baby born underweight (88.2 ounces or less).