Statistical Report 3

Statistical Report 3

Payton McCarthy

Professor Davis

MTH 332

21 February, 2020

 

Analysis of Carapace Sizes Before and After the Molting of Female Dungeness Crabs

 

Abstract:  This study looks to compare collected data of the various sizes of carapaces (shells) from female Dungeness crabs.  The data was taken after the year 1983 on the Pacific coast of North America. It was collected by the California Department of Fish and Game and various commercial crab fishers.  This study aims to examine the relationship between premolt and postmolt carapace sizes and summarize the results both numerically and graphically.

 

Intro and Background:   The Dungeness Crab is one of the largest and most abundant crabs on the Pacific coast.  When it grows, the crab molts periodically, casting off its shell and growing a new one.  It is reportedly easy to tell if a crab has molted recently by how clean its shell is. All measurements are in millimeters.

 

Methods:  This study uses the computer program RStudio to make statistical calculations about the dataset.  After downloading the data ‘crabs.data’ from the course website, use this code

crabsizes <- read.table(“crabs.data”,header=TRUE,sep=””)

to label the dataset as ‘crabsizes’.  

Then I separated the data into two different lists, one of premolt sizes and and the other of postmolt sizes.  

pre_crab_list <- pull(crabsizes, var = -5)

post_crab_list <- pull(crabsizes, var = -4)

I then calculated all of the descriptive statistics for both lists of values above by the following,

mean_pre_crab_list <- mean(pre_crab_list)

median_pre_crab_list <- median(pre_crab_list)

SD_pre_crab_list <- sd(pre_crab_list)

IQR_pre_crab_list <- IQR(pre_crab_list)

min_pre_crab_list <- min(pre_crab_list)

max_pre_crab_list <- max(pre_crab_list)

 

mean_post_crab_list <- mean(post_crab_list)

median_post_crab_list <- median(post_crab_list)

SD_post_crab_list <- sd(post_crab_list)

IQR_post_crab_list <- IQR(post_crab_list)

min_post_crab_list <- min(post_crab_list)

max_post_crab_list <- max(post_crab_list)

 

z_pre_crab_list <- (pre_crab_list – mean(pre_crab_list))/sd(pre_crab_list)

z_post_crab_list <- (post_crab_list – mean(post_crab_list))/sd(post_crab_list)

 

skewness_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^3) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 3)) )

skewness_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^3) / (length(post_crab_list)*(sd(post_crab_list) ^ 3)) )

 

kurtosis_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^4) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 4)) )

kurtosis_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^4) / (length(post_crab_list)*(sd(post_crab_list) ^ 4)) )

I then created a Normal Distribution graph for these two lists with,

ggplot(data = data.frame(pre_crab_list = c(30, 200)),

       mapping = aes(x = pre_crab_list)) +

  stat_function(mapping = aes(colour = “Premolted Carapace Sizes”),

                fun = dnorm,

                args = list(mean = mean(pre_crab_list),

                            sd = sd(pre_crab_list))) +

  stat_function(mapping = aes(colour = “Postmolted Carapace Sizes”),

                fun = dnorm,

                args = list(mean = mean(post_crab_list),

                            sd = sd(post_crab_list))) +

  scale_colour_manual(values = c(“blue”, “red”)) +

  labs(x = “Carapace Sizes”,

       y = “Probabilities”,

       title = “Normal Distributions of Premolt Sizes v Postmolt Sizes”)

followed by two histogram plots for both lists as well.

hist(pre_crab_list, 

     main=”Histogram of Premolt Sizes”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”darkorchid1″,

     xlim=c(30,160),

     las=1, 

     breaks=100)

 

hist(post_crab_list, 

     main=”Histogram of Postmolt Sizes”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”darkslateblue”,

     xlim=c(30,180),

     las=1, 

     breaks=100)

 

These were just simply the entire lists of premolt and postmolt carapace sizes, but now this study will look to focus more on whether these shells grew in the field or in the lab, and look at any statistical differences between such.  This was done so by splitting the original datatable into two, one of crabs molting in the field and the other of crabs molting in the lab.

 

field_crabs <- filter(crabsizes, lf == 0)

lab_crabs <- filter(crabsizes, lf == 1)

These two tables were each stripped down to a two lists each: Premolt sizes in the field, postmolt sizes in the field, premolt sizes in the lab, and postmolt sizes in the lab.

pre_field_crabs <- pull(field_crabs, var = -5)

post_field_crabs <- pull(field_crabs, var = -4)

pre_lab_crabs <- pull(lab_crabs, var = -5)

post_lab_crabs <- pull(lab_crabs, var = -4)

Now all the descriptive statistics for each of these four lists can be found fairly simply by:

mean_pre_field_crabs <- mean(pre_field_crabs)

median_pre_field_crabs <- median(pre_field_crabs)

SD_pre_field_crabs <- sd(pre_field_crabs)

IQR_pre_field_crabs <- IQR(pre_field_crabs)

min_pre_field_crabs <- min(pre_field_crabs)

max_pre_field_crabs <- max(pre_field_crabs)

 

mean_post_field_crabs <- mean(post_field_crabs)

median_post_field_crabs <- median(post_field_crabs)

SD_post_field_crabs <- sd(post_field_crabs)

IQR_post_field_crabs <- IQR(post_field_crabs)

min_post_field_crabs <- min(post_field_crabs)

max_post_field_crabs <- max(post_field_crabs)

 

mean_pre_lab_crabs <- mean(pre_lab_crabs)

median_pre_lab_crabs <- median(pre_lab_crabs)

SD_pre_lab_crabs <- sd(pre_lab_crabs)

IQR_pre_lab_crabs <- IQR(pre_lab_crabs)

min_pre_lab_crabs <- min(pre_lab_crabs)

max_pre_lab_crabs <- max(pre_lab_crabs)

 

mean_post_field_crabs <- mean(post_lab_crabs)

median_post_field_crabs <- median(post_lab_crabs)

SD_post_field_crabs <- sd(post_lab_crabs)

IQR_post_field_crabs <- IQR(post_lab_crabs)

min_post_field_crabs <- min(post_lab_crabs)

max_post_field_crabs <- max(post_lab_crabs)

 

z_pre_field_crabs <- (pre_field_crabs – mean(pre_field_crabs))/sd(pre_field_crabs)

z_post_field_crabs <- (post_field_crabs – mean(post_field_crabs))/sd(post_field_crabs)

z_pre_lab_crabs <- (pre_lab_crabs – mean(pre_lab_crabs))/sd(pre_lab_crabs)

z_post_lab_crabs <- (post_lab_crabs – mean(post_lab_crabs))/sd(post_lab_crabs)

 

skewness_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^3) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 3)) )

skewness_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^3) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 3)) )

skewness_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^3) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 3)) )

skewness_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^3) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 3)) )

 

kurtosis_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^4) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 4)) )

kurtosis_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^4) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 4)) )

kurtosis_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^4) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 4)) )

kurtosis_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^4) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 4)) )

The next thing performed was to display all this important statistical information in a visual way that would aid in understanding the results.  The first way this is done is by looking at the Normal Curves of all four lists to compare them easily:

ggplot(data = data.frame(pre_field_crabs = c(75, 200)),

       mapping = aes(x = pre_field_crabs)) +

  stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Field”),

                fun = dnorm,

                args = list(mean = mean(pre_field_crabs),

                            sd = sd(pre_field_crabs))) +

  stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Field”),

                fun = dnorm,

                args = list(mean = mean(post_field_crabs),

                            sd = sd(post_field_crabs))) +

  stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Lab”),

                fun = dnorm,

                args = list(mean = mean(pre_lab_crabs),

                            sd = sd(pre_lab_crabs))) +

  stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Lab”),

                fun = dnorm,

                args = list(mean = mean(post_lab_crabs),

                            sd = sd(post_lab_crabs))) +

  scale_colour_manual(values = c(“darkgreen”, “red”, “chartreuse3”, “darkgoldenrod1”)) +

  labs(x = “Carapace Sizes”,

       y = “Probabilities”,

       title = “Normal Distributions of (Premolt Sizes v Postmolt Sizes) v (In the Field v In the Lab)”)

Next I looked to create separate histograms for each of these lists:

hist(pre_field_crabs, 

       main=”Histogram of Premolt Sizes in the Field”, 

       xlab=”Carapace Sizes”, 

       border=”black”, 

       col=”chartreuse”,

       xlim=c(110,160),

       las=1, 

       breaks=100)

 

hist(pre_lab_crabs, 

       main=”Histogram of Premolt Sizes in the Lab”, 

       xlab=”Carapace Sizes”, 

       border=”black”, 

       col=”darkgoldenrod1″,

       xlim=c(30,160),

       las=1, 

       breaks=100)

 

hist(post_field_crabs, 

     main=”Histogram of Postmolt Sizes in the Field”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”darkgreen”,

     xlim=c(120,170),

     las=1, 

     breaks=100)

 

hist(post_lab_crabs, 

     main=”Histogram of Postmolt Sizes in the Lab”, 

     xlab=”Carapace Sizes”, 

     border=”black”, 

     col=”red”,

     xlim=c(30,170),

     las=1, 

     breaks=100)

 

Now this is where a procedure is developed for predicting a crab’s premolt size from its postmolt size, where the intention to derive an expression is displayed.  First, set the means of the total premolt and postmolt lists be equal to y and x, respectively. Finally, the residual standard deviation can be calculated (SD_r).

y <- mean_pre_crab_list

x <- mean_post_crab_list

Then, the correlation coefficient (r) can be calculated, which will help us to find the slope (b_hat) of the regression line.  Next is to solve for b_hat, and then a_hat.

r <- (1/length(post_crab_list)) * sum(((post_crab_list – x) / SD_post_crab_list) * ((pre_crab_list – y) / SD_pre_crab_list))

b_hat <- r * (SD_pre_crab_list / SD_post_crab_list)

a_hat <- b_hat * x – y

SD_r <- sqrt(1 – r^2) * sd(pre_crab_list)

The regression line prediction is thus:   yi = b_hat * xi  + a_hat.

To test if this equation is accurate, it was decided to do so on three different groups of postmolt sizes, that is the first group being all between 147.5mm and 152.5mm, the second group between 142.5mm and 147.5mm, and the last group being between 152.5mm and 157.5mm.  The first block of code filters all the rows into 3 different tables according to the specified ranges above. The second and third blocks of code pulls out the two columns containing those postmolt and premolt sizes from those three tables, respectively.

group1_crabsizes <- filter(crabsizes, postsz >= 147.5, postsz <= 152.5)

group2_crabsizes <- filter(crabsizes, postsz >= 142.5, postsz <= 147.5)

group3_crabsizes <- filter(crabsizes, postsz >= 152.5, postsz <= 157.5)

 

group1_postmolt <- pull(group1_crabsizes, var = -4)

group2_postmolt <- pull(group2_crabsizes, var = -4)

group3_postmolt <- pull(group3_crabsizes, var = -4)

 

group1_premolt <- pull(group1_crabsizes, var = -5)

group2_premolt <- pull(group2_crabsizes, var = -5)

group3_premolt <- pull(group3_crabsizes, var = -5)

Now to write out the calculations for what each expected premolt size should be (y_hat#), and compare that to what it actually is (group#_premolt).

y_hat1 <- b_hat * group1_postmolt  – a_hat

y_hat2 <- b_hat * group2_postmolt  – a_hat

y_hat3 <- b_hat * group3_postmolt  – a_hat

Compare now by finding each test statistic value:

test_statistic1 <- sum((y_hat1 – mean(group1_premolt))^2) / mean(group1_premolt)

test_statistic2 <- sum((y_hat2 – mean(group2_premolt))^2) / mean(group2_premolt)

test_statistic3 <- sum((y_hat3 – mean(group3_premolt))^2) / mean(group3_premolt)

 

Results:  Here is a table sharing all of the values of the relevant and descriptive statistics described above.  It has been color coordinated from green to red (green green-yellow, yellow, orange, and red). If a number is green then it is the largest in its row (descriptive statistic), and if it is red then it is the smallest in its row, and everything else in between.

Table:

Descriptive Statistic Premolt Sizes Postmolt Sizes Premolt Sizes – Field Postmolt Sizes – Field Premolt Sizes – Lab Postmolt Sizes – Lab
Mean 129.21186 143.89767 139.00901 152.96396 126.19945 141.10997
Median 132.8 154 140.1 154 128.9 143.7
Standard Deviation 15.86452 14.64060 7.25115 6.71997 16.56878 15.28078
IQR 18.325 15.45 7.70000 7 18.1 15.6
Minimum 31.1 38.8 113.6 127.7 31.1 38.8
Maximum 155.1 166.8 153.9 166.5 155.1 166.8
Z-3 81.6183 99.97587 117.25556 132.80405 76.49311 95.26763
Z-2 97.48282 114.61647 124.50671 139.52402 93.06189 110.54841
Z-1 113.34734 129.25707 131.75786 146.24399 109.63067 125.82919
Z0 129.21186 143.89767 139.00901 152.96396 126.19945 141.10997
Z1 145.07638 158.53827 146.26016 159.68393 142.76823 156.39075
Z2 160.9409 173.17887 153.51131 166.4039 159.33701 171.67153
Z3 176.80542 187.81927 160.76246 173.12387 175.90579 186.95231
Skewness -1.99712 -2.33945 -1.09590 -1.10398 -1.88109 -2.27861
Kurtosis 9.72498 13.06052 4.67604 5.14670 8.97447 12.37337

 

A comparative look now at the Normal Distributions for all premolt carapace shell sizes against all postmolt carapace sizes will help to paint a clear, visual picture.  It can be seen that postmolted carapace shells clearly have a larger probability of being larger than if they were premolted, but that is just common sense.  

It also could be nice to compare the histrograms of both of these distributions.

It can be seen that this tells the same story.  Now this study will look at the visual results of whether these shells grew relatively larger in the field or in the lab.  First the Normal Distributions and then each of the four histograms.

It can be clearly noted that, when in the field, the carapace sizes have a much higher probability of being larger (premolt and postmolt) rather than when they are observed to molt in the lalb (premolt or postmolt).

The expression for predicting a crab’s premolt size from its postmolt size that was derived in the methods section is now brought forth here: 

yi = b_hat * xi  + a_hat

From the calculations made in the program, b_hat = 1.07 and a_hat = 24.89.  Therefore, my regression line prediction expression is yi = 1.07* xi  + 24.89.

For any postmolt size (xi) that is given, multiply it by 1.07 and then add 24.89 and you will receive a fairly accurate prediction for that crabs premolt size (yi).

The residual SD (SD_r) was calculated to be about 2.42mm.  This means that when using the regression prediction equation above, about 68% of yi-values will fall within about 2.42mm above or below its prediction.  And about 95% of yi-values will fall within about 4.85mm above or below its appropriate prediction.  Finally, the values for each test statistic calculated between the expected and the observed differences in premolt sizes is:

test_statistic1 = 2.0489

test_statistic2 = 1.437982

test_statistic3 = 1.608636

Discussion and Conclusion:  Looking at the values for the test statistics, they are relatively low, which appears to mean that the observed values and the expected values are so close that the fit seems appropriate.  Therefore, this study completed what it set out to do, providing numerical and graphical data and also providing an expression that could predict accurately to a degree the premolt carapace size from only its postmolt carapace size.

Leave a Reply

Your email address will not be published. Required fields are marked *