# Molecular Biology: Sequence Analysis

See attached. please show all work in detail

--------------------------------------------------------

(a)

Given two sequences x and y as shown below

Determine the minimum number of edit operations (substitution, and indels) required to transform one into another

________________________________________________________________________

(b)

Determine the Hamming distance between the strings: CENTURY and SANCTUARY

Determine the Levenshtein distance between the strings: BIO-INFORMATICS and TRI-TELEMATICS

_____________________________________________________________________

Binary representation of a DNA sequence: Concept

Sometimes DNA sequence analysis can be done by converting the sequence into binary format. Foe example, suppose the following dibit representation is pursued: A ==> (11); C ==> (01); G ==> (10); T ==> (00). The a sequence ACCTGCA, for example can be written as: 11 01 01 00 10 01 11

(c)

By constructing binary format of the pair of sequences x and y given below determine the Hamming distance between them.

(Hint: For binary strings a and b, the Hamming distance is equal to the number of ones resulting in a XOR b operation).

_____________________________________________________________________

(d)

Given a template sequence: CCCAAGGGGTTCCAATG. Identify the underlying mutations and derivatives namely, point-mutations, deletions, inversions, transportations, duplications, insertions in the following set of strings that resemble the template:

CCCAAGGGGTTTCAATG

CCCAAGGGGTTTCxxxx

CCGGAACGGTTTC

TTTCCCGGAACGG

TTTCCCGGGGAAGG

TTTCCCGGTTAACTTTGG

TTTCCCGGTTAACTTGG

How will you designate the following sequence in relation to the template?

AAAGGCCAATTGAAACC

_____________________________________________________________________

(e)

Transition mutations are more common than transversions mutations.

Construct a matrix to illustrate such characteristics of the mutations. Assume proportionate percentage to depict each type of mutation.

A T G C

A

T

G

C

x: T A G C T A T C G G G A A C T G

y G C T C A C G G T T G G G A C T

#### Solution Preview

Let me know any questions or if you need more details or explanations.

--------------------------------------

(a)

Given two sequences x and y as shown below

Determine the minimum number of edit operations (substitution, and indels) required to transform one into another

x: T A G C T A T C G G G A A C T G

y G C T C A C G G T T G G G A C T

This problem is basically asking for the Levenshtein distance. The Levenshtein distance is defined as the minimum number of edits needed to transform one string into the other (the edits that are allowed are insertion, deletion, and substitution).

I used the calculator here (http://www.miislita.com/searchito/levenshtein-edit-distance.html) to determine that the Levenshtein distance for these two strings is 13 (this is if you cut-and-paste the sequences, including all the spaces).

If, instead, you use the strings that I believe the question is asking for (no spaces except one after the first string and one before the second string),

TAGCTATCGGGAACTG

GCTCACGGTTGGGACT

the answer becomes 10.

This website (http://www.merriampark.com/ld.htm) explains how the matrix used to compute the distance is created.

(b)

Determine the Hamming distance between the strings: CENTURY and SANCTUARY

Determine the Levenshtein distance between the strings: BIO-INFORMATICS and TRI-TELEMATICS

(1) The Hamming distance is similar to the Levenshtein distance, except instertions and deletions are not allowed. You find it by counting the number of characters that are different (the number of substitutions). (See http://www.ehow.com/how_5179242_calculate-hamming-distance.html for an explanation.)

Since CENTURY (7 letters) and SANCTUARY (9 letters) are of different length, the Hamming distance ...