Statistical Report 3
Payton McCarthy
Professor Davis
MTH 332
21 February, 2020
Analysis of Carapace Sizes Before and After the Molting of Female Dungeness Crabs
Abstract: This study looks to compare collected data of the various sizes of carapaces (shells) from female Dungeness crabs. The data was taken after the year 1983 on the Pacific coast of North America. It was collected by the California Department of Fish and Game and various commercial crab fishers. This study aims to examine the relationship between premolt and postmolt carapace sizes and summarize the results both numerically and graphically.
Intro and Background: The Dungeness Crab is one of the largest and most abundant crabs on the Pacific coast. When it grows, the crab molts periodically, casting off its shell and growing a new one. It is reportedly easy to tell if a crab has molted recently by how clean its shell is. All measurements are in millimeters.
Methods: This study uses the computer program RStudio to make statistical calculations about the dataset. After downloading the data ‘crabs.data’ from the course website, use this code
crabsizes <- read.table(“crabs.data”,header=TRUE,sep=””)
to label the dataset as ‘crabsizes’.
Then I separated the data into two different lists, one of premolt sizes and and the other of postmolt sizes.
pre_crab_list <- pull(crabsizes, var = -5)
post_crab_list <- pull(crabsizes, var = -4)
I then calculated all of the descriptive statistics for both lists of values above by the following,
mean_pre_crab_list <- mean(pre_crab_list)
median_pre_crab_list <- median(pre_crab_list)
SD_pre_crab_list <- sd(pre_crab_list)
IQR_pre_crab_list <- IQR(pre_crab_list)
min_pre_crab_list <- min(pre_crab_list)
max_pre_crab_list <- max(pre_crab_list)
mean_post_crab_list <- mean(post_crab_list)
median_post_crab_list <- median(post_crab_list)
SD_post_crab_list <- sd(post_crab_list)
IQR_post_crab_list <- IQR(post_crab_list)
min_post_crab_list <- min(post_crab_list)
max_post_crab_list <- max(post_crab_list)
z_pre_crab_list <- (pre_crab_list – mean(pre_crab_list))/sd(pre_crab_list)
z_post_crab_list <- (post_crab_list – mean(post_crab_list))/sd(post_crab_list)
skewness_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^3) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 3)) )
skewness_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^3) / (length(post_crab_list)*(sd(post_crab_list) ^ 3)) )
kurtosis_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^4) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 4)) )
kurtosis_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^4) / (length(post_crab_list)*(sd(post_crab_list) ^ 4)) )
I then created a Normal Distribution graph for these two lists with,
ggplot(data = data.frame(pre_crab_list = c(30, 200)),
mapping = aes(x = pre_crab_list)) +
stat_function(mapping = aes(colour = “Premolted Carapace Sizes”),
fun = dnorm,
args = list(mean = mean(pre_crab_list),
sd = sd(pre_crab_list))) +
stat_function(mapping = aes(colour = “Postmolted Carapace Sizes”),
fun = dnorm,
args = list(mean = mean(post_crab_list),
sd = sd(post_crab_list))) +
scale_colour_manual(values = c(“blue”, “red”)) +
labs(x = “Carapace Sizes”,
y = “Probabilities”,
title = “Normal Distributions of Premolt Sizes v Postmolt Sizes”)
followed by two histogram plots for both lists as well.
hist(pre_crab_list,
main=”Histogram of Premolt Sizes”,
xlab=”Carapace Sizes”,
border=”black”,
col=”darkorchid1″,
xlim=c(30,160),
las=1,
breaks=100)
hist(post_crab_list,
main=”Histogram of Postmolt Sizes”,
xlab=”Carapace Sizes”,
border=”black”,
col=”darkslateblue”,
xlim=c(30,180),
las=1,
breaks=100)
These were just simply the entire lists of premolt and postmolt carapace sizes, but now this study will look to focus more on whether these shells grew in the field or in the lab, and look at any statistical differences between such. This was done so by splitting the original datatable into two, one of crabs molting in the field and the other of crabs molting in the lab.
field_crabs <- filter(crabsizes, lf == 0)
lab_crabs <- filter(crabsizes, lf == 1)
These two tables were each stripped down to a two lists each: Premolt sizes in the field, postmolt sizes in the field, premolt sizes in the lab, and postmolt sizes in the lab.
pre_field_crabs <- pull(field_crabs, var = -5)
post_field_crabs <- pull(field_crabs, var = -4)
pre_lab_crabs <- pull(lab_crabs, var = -5)
post_lab_crabs <- pull(lab_crabs, var = -4)
Now all the descriptive statistics for each of these four lists can be found fairly simply by:
mean_pre_field_crabs <- mean(pre_field_crabs)
median_pre_field_crabs <- median(pre_field_crabs)
SD_pre_field_crabs <- sd(pre_field_crabs)
IQR_pre_field_crabs <- IQR(pre_field_crabs)
min_pre_field_crabs <- min(pre_field_crabs)
max_pre_field_crabs <- max(pre_field_crabs)
mean_post_field_crabs <- mean(post_field_crabs)
median_post_field_crabs <- median(post_field_crabs)
SD_post_field_crabs <- sd(post_field_crabs)
IQR_post_field_crabs <- IQR(post_field_crabs)
min_post_field_crabs <- min(post_field_crabs)
max_post_field_crabs <- max(post_field_crabs)
mean_pre_lab_crabs <- mean(pre_lab_crabs)
median_pre_lab_crabs <- median(pre_lab_crabs)
SD_pre_lab_crabs <- sd(pre_lab_crabs)
IQR_pre_lab_crabs <- IQR(pre_lab_crabs)
min_pre_lab_crabs <- min(pre_lab_crabs)
max_pre_lab_crabs <- max(pre_lab_crabs)
mean_post_field_crabs <- mean(post_lab_crabs)
median_post_field_crabs <- median(post_lab_crabs)
SD_post_field_crabs <- sd(post_lab_crabs)
IQR_post_field_crabs <- IQR(post_lab_crabs)
min_post_field_crabs <- min(post_lab_crabs)
max_post_field_crabs <- max(post_lab_crabs)
z_pre_field_crabs <- (pre_field_crabs – mean(pre_field_crabs))/sd(pre_field_crabs)
z_post_field_crabs <- (post_field_crabs – mean(post_field_crabs))/sd(post_field_crabs)
z_pre_lab_crabs <- (pre_lab_crabs – mean(pre_lab_crabs))/sd(pre_lab_crabs)
z_post_lab_crabs <- (post_lab_crabs – mean(post_lab_crabs))/sd(post_lab_crabs)
skewness_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^3) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 3)) )
skewness_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^3) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 3)) )
skewness_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^3) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 3)) )
skewness_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^3) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 3)) )
kurtosis_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^4) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 4)) )
kurtosis_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^4) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 4)) )
kurtosis_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^4) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 4)) )
kurtosis_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^4) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 4)) )
The next thing performed was to display all this important statistical information in a visual way that would aid in understanding the results. The first way this is done is by looking at the Normal Curves of all four lists to compare them easily:
ggplot(data = data.frame(pre_field_crabs = c(75, 200)),
mapping = aes(x = pre_field_crabs)) +
stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Field”),
fun = dnorm,
args = list(mean = mean(pre_field_crabs),
sd = sd(pre_field_crabs))) +
stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Field”),
fun = dnorm,
args = list(mean = mean(post_field_crabs),
sd = sd(post_field_crabs))) +
stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Lab”),
fun = dnorm,
args = list(mean = mean(pre_lab_crabs),
sd = sd(pre_lab_crabs))) +
stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Lab”),
fun = dnorm,
args = list(mean = mean(post_lab_crabs),
sd = sd(post_lab_crabs))) +
scale_colour_manual(values = c(“darkgreen”, “red”, “chartreuse3”, “darkgoldenrod1”)) +
labs(x = “Carapace Sizes”,
y = “Probabilities”,
title = “Normal Distributions of (Premolt Sizes v Postmolt Sizes) v (In the Field v In the Lab)”)
Next I looked to create separate histograms for each of these lists:
hist(pre_field_crabs,
main=”Histogram of Premolt Sizes in the Field”,
xlab=”Carapace Sizes”,
border=”black”,
col=”chartreuse”,
xlim=c(110,160),
las=1,
breaks=100)
hist(pre_lab_crabs,
main=”Histogram of Premolt Sizes in the Lab”,
xlab=”Carapace Sizes”,
border=”black”,
col=”darkgoldenrod1″,
xlim=c(30,160),
las=1,
breaks=100)
hist(post_field_crabs,
main=”Histogram of Postmolt Sizes in the Field”,
xlab=”Carapace Sizes”,
border=”black”,
col=”darkgreen”,
xlim=c(120,170),
las=1,
breaks=100)
hist(post_lab_crabs,
main=”Histogram of Postmolt Sizes in the Lab”,
xlab=”Carapace Sizes”,
border=”black”,
col=”red”,
xlim=c(30,170),
las=1,
breaks=100)
Now this is where a procedure is developed for predicting a crab’s premolt size from its postmolt size, where the intention to derive an expression is displayed. First, set the means of the total premolt and postmolt lists be equal to y and x, respectively. Finally, the residual standard deviation can be calculated (SD_r).
y <- mean_pre_crab_list
x <- mean_post_crab_list
Then, the correlation coefficient (r) can be calculated, which will help us to find the slope (b_hat) of the regression line. Next is to solve for b_hat, and then a_hat.
r <- (1/length(post_crab_list)) * sum(((post_crab_list – x) / SD_post_crab_list) * ((pre_crab_list – y) / SD_pre_crab_list))
b_hat <- r * (SD_pre_crab_list / SD_post_crab_list)
a_hat <- b_hat * x – y
SD_r <- sqrt(1 – r^2) * sd(pre_crab_list)
The regression line prediction is thus: yi = b_hat * xi + a_hat.
To test if this equation is accurate, it was decided to do so on three different groups of postmolt sizes, that is the first group being all between 147.5mm and 152.5mm, the second group between 142.5mm and 147.5mm, and the last group being between 152.5mm and 157.5mm. The first block of code filters all the rows into 3 different tables according to the specified ranges above. The second and third blocks of code pulls out the two columns containing those postmolt and premolt sizes from those three tables, respectively.
group1_crabsizes <- filter(crabsizes, postsz >= 147.5, postsz <= 152.5)
group2_crabsizes <- filter(crabsizes, postsz >= 142.5, postsz <= 147.5)
group3_crabsizes <- filter(crabsizes, postsz >= 152.5, postsz <= 157.5)
group1_postmolt <- pull(group1_crabsizes, var = -4)
group2_postmolt <- pull(group2_crabsizes, var = -4)
group3_postmolt <- pull(group3_crabsizes, var = -4)
group1_premolt <- pull(group1_crabsizes, var = -5)
group2_premolt <- pull(group2_crabsizes, var = -5)
group3_premolt <- pull(group3_crabsizes, var = -5)
Now to write out the calculations for what each expected premolt size should be (y_hat#), and compare that to what it actually is (group#_premolt).
y_hat1 <- b_hat * group1_postmolt – a_hat
y_hat2 <- b_hat * group2_postmolt – a_hat
y_hat3 <- b_hat * group3_postmolt – a_hat
Compare now by finding each test statistic value:
test_statistic1 <- sum((y_hat1 – mean(group1_premolt))^2) / mean(group1_premolt)
test_statistic2 <- sum((y_hat2 – mean(group2_premolt))^2) / mean(group2_premolt)
test_statistic3 <- sum((y_hat3 – mean(group3_premolt))^2) / mean(group3_premolt)
Results: Here is a table sharing all of the values of the relevant and descriptive statistics described above. It has been color coordinated from green to red (green green-yellow, yellow, orange, and red). If a number is green then it is the largest in its row (descriptive statistic), and if it is red then it is the smallest in its row, and everything else in between.
Table:
Descriptive Statistic |
|
Premolt Sizes |
Postmolt Sizes |
Premolt Sizes – Field |
Postmolt Sizes – Field |
Premolt Sizes – Lab |
Postmolt Sizes – Lab |
Mean |
|
129.21186 |
143.89767 |
139.00901 |
152.96396 |
126.19945 |
141.10997 |
Median |
|
132.8 |
154 |
140.1 |
154 |
128.9 |
143.7 |
Standard Deviation |
|
15.86452 |
14.64060 |
7.25115 |
6.71997 |
16.56878 |
15.28078 |
IQR |
|
18.325 |
15.45 |
7.70000 |
7 |
18.1 |
15.6 |
Minimum |
|
31.1 |
38.8 |
113.6 |
127.7 |
31.1 |
38.8 |
Maximum |
|
155.1 |
166.8 |
153.9 |
166.5 |
155.1 |
166.8 |
|
|
|
|
|
|
|
|
Z-3 |
|
81.6183 |
99.97587 |
117.25556 |
132.80405 |
76.49311 |
95.26763 |
Z-2 |
|
97.48282 |
114.61647 |
124.50671 |
139.52402 |
93.06189 |
110.54841 |
Z-1 |
|
113.34734 |
129.25707 |
131.75786 |
146.24399 |
109.63067 |
125.82919 |
Z0 |
|
129.21186 |
143.89767 |
139.00901 |
152.96396 |
126.19945 |
141.10997 |
Z1 |
|
145.07638 |
158.53827 |
146.26016 |
159.68393 |
142.76823 |
156.39075 |
Z2 |
|
160.9409 |
173.17887 |
153.51131 |
166.4039 |
159.33701 |
171.67153 |
Z3 |
|
176.80542 |
187.81927 |
160.76246 |
173.12387 |
175.90579 |
186.95231 |
Skewness |
|
-1.99712 |
-2.33945 |
-1.09590 |
-1.10398 |
-1.88109 |
-2.27861 |
Kurtosis |
|
9.72498 |
13.06052 |
4.67604 |
5.14670 |
8.97447 |
12.37337 |
A comparative look now at the Normal Distributions for all premolt carapace shell sizes against all postmolt carapace sizes will help to paint a clear, visual picture. It can be seen that postmolted carapace shells clearly have a larger probability of being larger than if they were premolted, but that is just common sense.
It also could be nice to compare the histrograms of both of these distributions.
It can be seen that this tells the same story. Now this study will look at the visual results of whether these shells grew relatively larger in the field or in the lab. First the Normal Distributions and then each of the four histograms.
It can be clearly noted that, when in the field, the carapace sizes have a much higher probability of being larger (premolt and postmolt) rather than when they are observed to molt in the lalb (premolt or postmolt).
The expression for predicting a crab’s premolt size from its postmolt size that was derived in the methods section is now brought forth here:
yi = b_hat * xi + a_hat
From the calculations made in the program, b_hat = 1.07 and a_hat = 24.89. Therefore, my regression line prediction expression is yi = 1.07* xi + 24.89.
For any postmolt size (xi) that is given, multiply it by 1.07 and then add 24.89 and you will receive a fairly accurate prediction for that crabs premolt size (yi).
The residual SD (SD_r) was calculated to be about 2.42mm. This means that when using the regression prediction equation above, about 68% of yi-values will fall within about 2.42mm above or below its prediction. And about 95% of yi-values will fall within about 4.85mm above or below its appropriate prediction. Finally, the values for each test statistic calculated between the expected and the observed differences in premolt sizes is:
test_statistic1 = 2.0489
test_statistic2 = 1.437982
test_statistic3 = 1.608636
Discussion and Conclusion: Looking at the values for the test statistics, they are relatively low, which appears to mean that the observed values and the expected values are so close that the fit seems appropriate. Therefore, this study completed what it set out to do, providing numerical and graphical data and also providing an expression that could predict accurately to a degree the premolt carapace size from only its postmolt carapace size.