Statistical Report 3

Payton McCarthy

Professor Davis

MTH 332

21 February, 2020

Analysis of Carapace Sizes Before and After the Molting of Female Dungeness Crabs

Abstract: This study looks to compare collected data of the various sizes of carapaces (shells) from female Dungeness crabs. The data was taken after the year 1983 on the Pacific coast of North America. It was collected by the California Department of Fish and Game and various commercial crab fishers. This study aims to examine the relationship between premolt and postmolt carapace sizes and summarize the results both numerically and graphically.

Intro and Background: The Dungeness Crab is one of the largest and most abundant crabs on the Pacific coast. When it grows, the crab molts periodically, casting off its shell and growing a new one. It is reportedly easy to tell if a crab has molted recently by how clean its shell is. All measurements are in millimeters.

Methods: This study uses the computer program RStudio to make statistical calculations about the dataset. After downloading the data ‘crabs.data’ from the course website, use this code

crabsizes <- read.table(“crabs.data”,header=TRUE,sep=””)

to label the dataset as ‘crabsizes’.

Then I separated the data into two different lists, one of premolt sizes and and the other of postmolt sizes.

pre_crab_list <- pull(crabsizes, var = -5)

post_crab_list <- pull(crabsizes, var = -4)

I then calculated all of the descriptive statistics for both lists of values above by the following,

mean_pre_crab_list <- mean(pre_crab_list)

median_pre_crab_list <- median(pre_crab_list)

SD_pre_crab_list <- sd(pre_crab_list)

IQR_pre_crab_list <- IQR(pre_crab_list)

min_pre_crab_list <- min(pre_crab_list)

max_pre_crab_list <- max(pre_crab_list)

mean_post_crab_list <- mean(post_crab_list)

median_post_crab_list <- median(post_crab_list)

SD_post_crab_list <- sd(post_crab_list)

IQR_post_crab_list <- IQR(post_crab_list)

min_post_crab_list <- min(post_crab_list)

max_post_crab_list <- max(post_crab_list)

z_pre_crab_list <- (pre_crab_list – mean(pre_crab_list))/sd(pre_crab_list)

z_post_crab_list <- (post_crab_list – mean(post_crab_list))/sd(post_crab_list)

skewness_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^3) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 3)) )

skewness_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^3) / (length(post_crab_list)*(sd(post_crab_list) ^ 3)) )

kurtosis_pre_crab_list <- sum( ((pre_crab_list – mean(pre_crab_list))^4) / (length(pre_crab_list)*(sd(pre_crab_list) ^ 4)) )

kurtosis_post_crab_list <- sum( ((post_crab_list – mean(post_crab_list))^4) / (length(post_crab_list)*(sd(post_crab_list) ^ 4)) )

I then created a Normal Distribution graph for these two lists with,

ggplot(data = data.frame(pre_crab_list = c(30, 200)),

mapping = aes(x = pre_crab_list)) +

stat_function(mapping = aes(colour = “Premolted Carapace Sizes”),

fun = dnorm,

args = list(mean = mean(pre_crab_list),

sd = sd(pre_crab_list))) +

stat_function(mapping = aes(colour = “Postmolted Carapace Sizes”),

fun = dnorm,

args = list(mean = mean(post_crab_list),

sd = sd(post_crab_list))) +

scale_colour_manual(values = c(“blue”, “red”)) +

labs(x = “Carapace Sizes”,

y = “Probabilities”,

title = “Normal Distributions of Premolt Sizes v Postmolt Sizes”)

followed by two histogram plots for both lists as well.

hist(pre_crab_list,

main=”Histogram of Premolt Sizes”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkorchid1″,

xlim=c(30,160),

las=1,

breaks=100)

hist(post_crab_list,

main=”Histogram of Postmolt Sizes”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkslateblue”,

xlim=c(30,180),

las=1,

breaks=100)

These were just simply the entire lists of premolt and postmolt carapace sizes, but now this study will look to focus more on whether these shells grew in the field or in the lab, and look at any statistical differences between such. This was done so by splitting the original datatable into two, one of crabs molting in the field and the other of crabs molting in the lab.

field_crabs <- filter(crabsizes, lf == 0)

lab_crabs <- filter(crabsizes, lf == 1)

These two tables were each stripped down to a two lists each: Premolt sizes in the field, postmolt sizes in the field, premolt sizes in the lab, and postmolt sizes in the lab.

pre_field_crabs <- pull(field_crabs, var = -5)

post_field_crabs <- pull(field_crabs, var = -4)

pre_lab_crabs <- pull(lab_crabs, var = -5)

post_lab_crabs <- pull(lab_crabs, var = -4)

Now all the descriptive statistics for each of these four lists can be found fairly simply by:

mean_pre_field_crabs <- mean(pre_field_crabs)

median_pre_field_crabs <- median(pre_field_crabs)

SD_pre_field_crabs <- sd(pre_field_crabs)

IQR_pre_field_crabs <- IQR(pre_field_crabs)

min_pre_field_crabs <- min(pre_field_crabs)

max_pre_field_crabs <- max(pre_field_crabs)

mean_post_field_crabs <- mean(post_field_crabs)

median_post_field_crabs <- median(post_field_crabs)

SD_post_field_crabs <- sd(post_field_crabs)

IQR_post_field_crabs <- IQR(post_field_crabs)

min_post_field_crabs <- min(post_field_crabs)

max_post_field_crabs <- max(post_field_crabs)

mean_pre_lab_crabs <- mean(pre_lab_crabs)

median_pre_lab_crabs <- median(pre_lab_crabs)

SD_pre_lab_crabs <- sd(pre_lab_crabs)

IQR_pre_lab_crabs <- IQR(pre_lab_crabs)

min_pre_lab_crabs <- min(pre_lab_crabs)

max_pre_lab_crabs <- max(pre_lab_crabs)

mean_post_field_crabs <- mean(post_lab_crabs)

median_post_field_crabs <- median(post_lab_crabs)

SD_post_field_crabs <- sd(post_lab_crabs)

IQR_post_field_crabs <- IQR(post_lab_crabs)

min_post_field_crabs <- min(post_lab_crabs)

max_post_field_crabs <- max(post_lab_crabs)

z_pre_field_crabs <- (pre_field_crabs – mean(pre_field_crabs))/sd(pre_field_crabs)

z_post_field_crabs <- (post_field_crabs – mean(post_field_crabs))/sd(post_field_crabs)

z_pre_lab_crabs <- (pre_lab_crabs – mean(pre_lab_crabs))/sd(pre_lab_crabs)

z_post_lab_crabs <- (post_lab_crabs – mean(post_lab_crabs))/sd(post_lab_crabs)

skewness_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^3) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 3)) )

skewness_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^3) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 3)) )

skewness_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^3) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 3)) )

skewness_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^3) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 3)) )

kurtosis_pre_field_crabs <- sum( ((pre_field_crabs – mean(pre_field_crabs))^4) / (length(pre_field_crabs)*(sd(pre_field_crabs) ^ 4)) )

kurtosis_post_field_crabs <- sum( ((post_field_crabs – mean(post_field_crabs))^4) / (length(post_field_crabs)*(sd(post_field_crabs) ^ 4)) )

kurtosis_pre_lab_crabs <- sum( ((pre_lab_crabs – mean(pre_lab_crabs))^4) / (length(pre_lab_crabs)*(sd(pre_lab_crabs) ^ 4)) )

kurtosis_post_lab_crabs <- sum( ((post_lab_crabs – mean(post_lab_crabs))^4) / (length(post_lab_crabs)*(sd(post_lab_crabs) ^ 4)) )

The next thing performed was to display all this important statistical information in a visual way that would aid in understanding the results. The first way this is done is by looking at the Normal Curves of all four lists to compare them easily:

ggplot(data = data.frame(pre_field_crabs = c(75, 200)),

mapping = aes(x = pre_field_crabs)) +

stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Field”),

fun = dnorm,

args = list(mean = mean(pre_field_crabs),

sd = sd(pre_field_crabs))) +

stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Field”),

fun = dnorm,

args = list(mean = mean(post_field_crabs),

sd = sd(post_field_crabs))) +

stat_function(mapping = aes(colour = “Premolted Carapace Sizes in the Lab”),

fun = dnorm,

args = list(mean = mean(pre_lab_crabs),

sd = sd(pre_lab_crabs))) +

stat_function(mapping = aes(colour = “Postmolted Carapace Sizes in the Lab”),

fun = dnorm,

args = list(mean = mean(post_lab_crabs),

sd = sd(post_lab_crabs))) +

scale_colour_manual(values = c(“darkgreen”, “red”, “chartreuse3”, “darkgoldenrod1”)) +

labs(x = “Carapace Sizes”,

y = “Probabilities”,

title = “Normal Distributions of (Premolt Sizes v Postmolt Sizes) v (In the Field v In the Lab)”)

Next I looked to create separate histograms for each of these lists:

hist(pre_field_crabs,

main=”Histogram of Premolt Sizes in the Field”,

xlab=”Carapace Sizes”,

border=”black”,

col=”chartreuse”,

xlim=c(110,160),

las=1,

breaks=100)

hist(pre_lab_crabs,

main=”Histogram of Premolt Sizes in the Lab”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkgoldenrod1″,

xlim=c(30,160),

las=1,

breaks=100)

hist(post_field_crabs,

main=”Histogram of Postmolt Sizes in the Field”,

xlab=”Carapace Sizes”,

border=”black”,

col=”darkgreen”,

xlim=c(120,170),

las=1,

breaks=100)

hist(post_lab_crabs,

main=”Histogram of Postmolt Sizes in the Lab”,

xlab=”Carapace Sizes”,

border=”black”,

col=”red”,

xlim=c(30,170),

las=1,

breaks=100)

Now this is where a procedure is developed for predicting a crab’s premolt size from its postmolt size, where the intention to derive an expression is displayed. First, set the means of the total premolt and postmolt lists be equal to y and x, respectively. Finally, the residual standard deviation can be calculated (SD_r).

y <- mean_pre_crab_list

x <- mean_post_crab_list

Then, the correlation coefficient (r) can be calculated, which will help us to find the slope (b_hat) of the regression line. Next is to solve for b_hat, and then a_hat.

r <- (1/length(post_crab_list)) * sum(((post_crab_list – x) / SD_post_crab_list) * ((pre_crab_list – y) / SD_pre_crab_list))

b_hat <- r * (SD_pre_crab_list / SD_post_crab_list)

a_hat <- b_hat * x – y

SD_r <- sqrt(1 – r^2) * sd(pre_crab_list)

The regression line prediction is thus: yi = b_hat * xi + a_hat.

To test if this equation is accurate, it was decided to do so on three different groups of postmolt sizes, that is the first group being all between 147.5mm and 152.5mm, the second group between 142.5mm and 147.5mm, and the last group being between 152.5mm and 157.5mm. The first block of code filters all the rows into 3 different tables according to the specified ranges above. The second and third blocks of code pulls out the two columns containing those postmolt and premolt sizes from those three tables, respectively.

group1_crabsizes <- filter(crabsizes, postsz >= 147.5, postsz <= 152.5)

group2_crabsizes <- filter(crabsizes, postsz >= 142.5, postsz <= 147.5)

group3_crabsizes <- filter(crabsizes, postsz >= 152.5, postsz <= 157.5)

group1_postmolt <- pull(group1_crabsizes, var = -4)

group2_postmolt <- pull(group2_crabsizes, var = -4)

group3_postmolt <- pull(group3_crabsizes, var = -4)

group1_premolt <- pull(group1_crabsizes, var = -5)

group2_premolt <- pull(group2_crabsizes, var = -5)

group3_premolt <- pull(group3_crabsizes, var = -5)

Now to write out the calculations for what each expected premolt size should be (y_hat#), and compare that to what it actually is (group#_premolt).

y_hat1 <- b_hat * group1_postmolt – a_hat

y_hat2 <- b_hat * group2_postmolt – a_hat

y_hat3 <- b_hat * group3_postmolt – a_hat

Compare now by finding each test statistic value:

test_statistic1 <- sum((y_hat1 – mean(group1_premolt))^2) / mean(group1_premolt)

test_statistic2 <- sum((y_hat2 – mean(group2_premolt))^2) / mean(group2_premolt)

test_statistic3 <- sum((y_hat3 – mean(group3_premolt))^2) / mean(group3_premolt)

Results: Here is a table sharing all of the values of the relevant and descriptive statistics described above. It has been color coordinated from green to red (green green-yellow, yellow, orange, and red). If a number is green then it is the largest in its row (descriptive statistic), and if it is red then it is the smallest in its row, and everything else in between.

Table:

Descriptive Statistic	Premolt Sizes	Postmolt Sizes	Premolt Sizes – Field	Postmolt Sizes – Field	Premolt Sizes – Lab	Postmolt Sizes – Lab
Mean	129.21186	143.89767	139.00901	152.96396	126.19945	141.10997
Median	132.8	154	140.1	154	128.9	143.7
Standard Deviation	15.86452	14.64060	7.25115	6.71997	16.56878	15.28078
IQR	18.325	15.45	7.70000	7	18.1	15.6
Minimum	31.1	38.8	113.6	127.7	31.1	38.8
Maximum	155.1	166.8	153.9	166.5	155.1	166.8

Z-3	81.6183	99.97587	117.25556	132.80405	76.49311	95.26763
Z-2	97.48282	114.61647	124.50671	139.52402	93.06189	110.54841
Z-1	113.34734	129.25707	131.75786	146.24399	109.63067	125.82919
Z0	129.21186	143.89767	139.00901	152.96396	126.19945	141.10997
Z1	145.07638	158.53827	146.26016	159.68393	142.76823	156.39075
Z2	160.9409	173.17887	153.51131	166.4039	159.33701	171.67153
Z3	176.80542	187.81927	160.76246	173.12387	175.90579	186.95231
Skewness	-1.99712	-2.33945	-1.09590	-1.10398	-1.88109	-2.27861
Kurtosis	9.72498	13.06052	4.67604	5.14670	8.97447	12.37337

A comparative look now at the Normal Distributions for all premolt carapace shell sizes against all postmolt carapace sizes will help to paint a clear, visual picture. It can be seen that postmolted carapace shells clearly have a larger probability of being larger than if they were premolted, but that is just common sense.

It also could be nice to compare the histrograms of both of these distributions.

It can be seen that this tells the same story. Now this study will look at the visual results of whether these shells grew relatively larger in the field or in the lab. First the Normal Distributions and then each of the four histograms.

It can be clearly noted that, when in the field, the carapace sizes have a much higher probability of being larger (premolt and postmolt) rather than when they are observed to molt in the lalb (premolt or postmolt).

The expression for predicting a crab’s premolt size from its postmolt size that was derived in the methods section is now brought forth here:

yi = b_hat * xi + a_hat

From the calculations made in the program, b_hat = 1.07 and a_hat = 24.89. Therefore, my regression line prediction expression is yi = 1.07* xi + 24.89.

For any postmolt size (xi) that is given, multiply it by 1.07 and then add 24.89 and you will receive a fairly accurate prediction for that crabs premolt size (yi).

The residual SD (SD_r) was calculated to be about 2.42mm. This means that when using the regression prediction equation above, about 68% of yi-values will fall within about 2.42mm above or below its prediction. And about 95% of yi-values will fall within about 4.85mm above or below its appropriate prediction. Finally, the values for each test statistic calculated between the expected and the observed differences in premolt sizes is:

test_statistic1 = 2.0489

test_statistic2 = 1.437982

test_statistic3 = 1.608636

Discussion and Conclusion: Looking at the values for the test statistics, they are relatively low, which appears to mean that the observed values and the expected values are so close that the fit seems appropriate. Therefore, this study completed what it set out to do, providing numerical and graphical data and also providing an expression that could predict accurately to a degree the premolt carapace size from only its postmolt carapace size.

Leave a Reply Cancel reply