Statistical Report 4

Statistical Report 4Payton McCarthy

Professor Davis

MTH 332

5 May, 2020

The Search for the Origin of the Replication of Palindromes within CMV

 

Abstract:  This study looks to compare the structures in the DNA for clusters of palindromes.  The search is for the origin of the replication of the palindromes within the DNA of the human cytomegalovirus (CMV), a potentially life-threatening disease.  The data is from M.S. Chee who published the DNA sequence of CMV in 1990, and also from M.Y. Leung who was able to implement search algorithms in a computer program in order to screen the DNA sequence for many types of patterns. 

 

Intro and Background:  DNA like a coded message that uses a four-letter alphabet, that is A, C, T, and G.  The order of which these four letters is important, as patterns or repetitions of these letters within the DNA could be of significance, such as where the origin of replication within the DNA is.  One type of pattern of focus here is that of a complementary palindrome.  The letter-pairs of A-T and C-G are complementary to each other.  A complementary palindrome is a sequence of letters that when in reverse is the ‘complement’ of the forward sequence.  AT and TA are both complementary palindromes, short ones at that.  CATG, GTAC, ACGT, and TGCA are also all complementary palindromes.  In this study, there were 296 palindrome sequences that were at least 10 letters long found, the longest being 18 letters long.  The entire CMV DNA molecule contains 229,354 complementary pairs of letters or base pairs.  It is freaking huge.

 

Methods:  I remember I had problems starting with this assignment where I had to go to your office for help.  We figured it out eventually, but afterwards was Spring Break and then the nationwide quarantine occurred.  It has been a struggle to keep up with everything on my plate since all of these drastic changes, but I will share what I was able to record.

To download the data, I used the code

data <- read.table(“hcmv.txt”, header = FALSE, sep = “,”)

data <- as.data.frame(t(data))

to label the dataset as ‘data’.  I decided to define some variables to play with by defining L, 

L <- c()

  k <- 0

  while(k<=56){

    L <- c(L, sum(data >= k * 4000 + 1, data <= (k+1)*4000))

  k <- k+1}

I then followed this by defining ‘counts’ and ‘lambda’ as,

counts <- as.data.frame(table(L))

lambda <- (294/57)

Now I defined the equation for the Poisson Distribution as:

PoissonDistribution <- function(j) {

    i <- 0

    PDvalues <- list()

    while (i <= j)

    {

      k_points_sum <- sum(lambda ^ i  /  factorial(i))

      PDvalue <- k_points_sum * exp(-lambda)

      PDvalues[[i+1]] <- PDvalue

      i <- i+1}

    return(sum(PDvalues))}

Unfortunately, this was all I was able to put together.

Discussion and Conclusion:  Since I was not able to fully complete this assignment, I do not have much conclusive results to share.