DNA Splice Junctions I: Cleaning and Preparing the Data

Splice junctions are locations on sequences of DNA or RNA where superfluous sections are removed when proteins are created. After the splice, a section, known as the intron, is removed and the remaining sections, known as the exons, are joined together. Being able to identify these sequences of DNA is useful but time-consuming. This begs the question: Can spliced sections of DNA be determined with machine learning?

In the next two posts we’ll try to do exactly that. To do this, we’re going to use the UCI Splice-junction Gene Sequence dataset. It consists of sequences of DNA that contain either the part of the DNA retained after splicing, the part that was spliced out, or neither. Our problem is to distinguish between these cases.

In this post, I’m going to focus on cleaning and preparing the data set. In the next, I’ll walk through logistic regression to show how it works.

Table of Contents

Getting the Data
Examining the Data
Putting it All Back Together

splice Image from Wikipedia

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from pandas_profiling import ProfileReport

Getting the Data

This dataset was prepared by the UCI Machine Learning Repository, so we can download it from their page and read it using pandas.

df = pd.read_csv(r"datasets/DNA/splice.data",
    names=["Class", "Instance", "Sequence"])

Examining the Data

df.head()

	Class	Instance	Sequence
0	EI	ATRINS-DONOR-521	CCAGCTGCATCACAGGAGGCCAGCGAGCAGG...
1	EI	ATRINS-DONOR-905	AGACCCGCCGGGAGGCGGAGGACCTGCAGGG...
2	EI	BABAPOE-DONOR-30	GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG...
3	EI	BABAPOE-DONOR-867	GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT...
4	EI	BABAPOE-DONOR-2817	GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG...

df.describe()

	Class	Instance	Sequence
count	3190	3190	3190
unique	3	3178	3092
top	N	HUMALBGC-DONOR-17044	CAAATTGTGGACGTGATTCCCTTTCTCAGGGTGAG...
freq	1655	2	3

Looks like we have only three different classes. Almost all the instances and sequences are unique though. Let’s look at the classes.

UPDATE: I recently found an amazing tool called pandas-profiling that makes it fast and easy to learn about the dataset. I thought it was so cool that I wanted to demonstrate it here.

profile = ProfileReport(df, title="Pandas Profiling Report")

Once you generate a profile, there are two ways to display it. One is profile.to_widgets() which works well in a Jupyter Notebook but doesn’t display on this blog, so I’ll skip that for this post. The other way is by using profile.to_notebook_iframe().

profile.to_notebook_iframe()

Render HTML: 100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.49it/s]

This tool does some of the same exploratory data analysis that I show below, but I’m going to keep it in anyway because it’s good to remember how to do it manually.

Class

Let’s see what our three classes are and how many we have of each type.

Y = df['Class']

Y.groupby(Y).count()

Class
EI     767
IE     768
N     1655
Name: Class, dtype: int64

The labels are currently strings of either ‘IE’, ‘EI’, or ‘N’. They’re well-balanced between EI and IE, although most of the samples are neither. To use these labels to train an algorithm, we’ll need to encode them as integers.

le = LabelEncoder()
le.fit(Y)
# record the label encoder mapping
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
y = le.transform(Y)

Here’s the encoding we’ve done:

le_name_mapping

{'EI': 0, 'IE': 1, 'N': 2}

y.shape

(3190,)

Now our labels are an (N,) shape array of 0, 1, and 2. That’s all we need to do with them. Let’s dig into the instance names.

Instance

Let’s look at some of them and see if we can find a pattern. Let’s look at the first 20 then another random sample of 20.

df['Instance'][:20]

           ATRINS-DONOR-521
           ATRINS-DONOR-905
           BABAPOE-DONOR-30
          BABAPOE-DONOR-867
         BABAPOE-DONOR-2817
         CHPIGECA-DONOR-378
         CHPIGECA-DONOR-903
        CHPIGECA-DONOR-1313
        GCRHBBA1-DONOR-1260
        GCRHBBA1-DONOR-1590
        GCRHBBA6-DONOR-461
        GCRHBBA6-DONOR-795
       GIBHBGGL-DONOR-2278
       GIBHBGGL-DONOR-2624
       GIBHBGGL-DONOR-7198
       GIBHBGGL-DONOR-7544
       HUMA1ATP-DONOR-1972
       HUMA1ATP-DONOR-7932
       HUMA1ATP-DONOR-9653
      HUMA1ATP-DONOR-11057
Name: Instance, dtype: object

df['Instance'].sample(20, random_state=0)

       HUMCP21OHC-ACCEPTOR-1015
              HUMTNFA-DONOR-952
            HUMLYL1B-DONOR-2722
           HUMMHB27B-DONOR-1478
              HUMFOS-DONOR-2005
        HUMTUBAG-ACCEPTOR-2034
               HUMTHR-NEG-1501
               HUMZFY-NEG-2341
            HUMMRP8A-DONOR-1504
               HUMADH2E2-NEG-1
                 HUMASA-NEG-61
         HUMIL1B-ACCEPTOR-1506
        HUMUBILP-ACCEPTOR-1488
               HUMALDH03-NEG-1
             HUMCGPRA-NEG-1981
        HUMALBGC-ACCEPTOR-13672
             HUMSODA-DONOR-3967
           HUMATP1A2-NEG-11221
             HUMFGRINT-NEG-421
            HUMACCYBB-DONOR-2438
Name: Instance, dtype: object

It looks like we have a sequence of letters that often repeat in nearby rows; followed by one of the three words DONOR, ACCEPTOR, and NEG; followed by a number. We could parse the different parts out and use them individually as features. The dashes make this very easy with pandas’ built-in regular expression matching.

df['Instance_Prefix'] = df['Instance'].str.extract("(.*)-(\w*)-(\d*)")[0]
df['Instance_Donor'] = df['Instance'].str.extract("(.*)-(\w*)-(\d*)")[1]
df['Instance_Number'] = df['Instance'].str.extract("(.*)-(\w*)-(\d*)")[2]

df.head()

	Class	Instance	Sequence	Instance_Prefix	Instance_Donor	Instance_Number
0	EI	ATRINS-DONOR-521	CCAGCTGCATCACAGGAGGCCAGCGAGCAGG...	ATRINS	DONOR	521
1	EI	ATRINS-DONOR-905	AGACCCGCCGGGAGGCGGAGGACCTGCAGGG...	ATRINS	DONOR	905
2	EI	BABAPOE-DONOR-30	GAGGTGAAGGACGTCCTTCCCCAGGAGCCGG...	BABAPOE	DONOR	30
3	EI	BABAPOE-DONOR-867	GGGCTGCGTTGCTGGTCACATTCCTGGCAGGT...	BABAPOE	DONOR	867
4	EI	BABAPOE-DONOR-2817	GCTCAGCCCCCAGGTCACCCAGGAACTGACGTG...	BABAPOE	DONOR	2817

Prefix

Let’s see how many unique prefixes we have

len(df['Instance_Prefix'].unique())

That’s a lot. Sometimes with categorical data like this, we would use one-hot encoding, where we make each instance its own column and give the value of 1 if it’s that type and 0 otherwise. But that would result in a large sparse matrix. That’s not necessarily a problem, but we’ll put it aside for now and maybe use it later.

Now let’s look at the donor part.

Donor

df['Instance_Donor'].unique()

array(['DONOR', 'ACCEPTOR', 'NEG'], dtype=object)

OK, just the three we saw before. That’s good.

Let’s encode these in our dataset using one-hot encoding. We could use SKLearn’s labelEncoder and then One Hot, but pandas has something called get_dummies built-in that works with strings.

donor_one_hot_df = pd.get_dummies(df['Instance_Donor'])
donor_one_hot_df.sample(10, random_state=0)

	ACCEPTOR	DONOR	NEG
982	1	0	0
702	0	1	0
461	0	1	0
480	0	1	0
298	0	1	0
1495	1	0	0
3055	0	0	1
3163	0	0	1
543	0	1	0
1615	0	0	1

Now we’ll convert it to a numpy array.

X_donor = np.asarray(donor_one_hot_df)
X_donor.shape

(3190, 3)

Number

I don’t think the Instance_Number is going to help us, so I’m going to ignore it for the moment. Now let’s look at the sequence.

Sequence

Finding unique values

I would assume the Sequence consists of only A, C, T, and G, but we’ll double check just to be sure.

df['Sequence'][0]

'               CCAGCTGCATCACAGGAGGCCAGCGAGCAGGTCTGTTCCAAGGGCCTTCGAGCCAGTCTG'

The first example looks good except for the extra white space. We’ll use .strip() to remove it. Let’s go through the rest.

def find_letters(row):
    return set(row.strip())

Let’s turn the pandas series df['Sequence'] into a pandas series of the set of each row. This will remove duplicate values.

set_series = df['Sequence'].apply(find_letters)

Find the union of the sets to show all the unique values.

set.union(*set_series)

{'A', 'C', 'D', 'G', 'N', 'R', 'S', 'T'}

OK, we have a lot more letters than I expected. Let’s see how common they are and how they’re used.

Exploring Unexpected Values

set_series[set_series.str.contains('D', regex=False)]

1247    {D, C, G, T, A}
2578    {D, C, T, G, A}
Name: Sequence, dtype: object

Only two instances, that’s good. Let’s look at one of them.

df.loc[1247].Sequence.strip()

'DGACGGGGCTGACCGCGGGGGCGGGTCCAGGGTCTCACACCCTCCAGAATATGTATGGCT'

After consulting the original documentation again, it looks like these letters are used when there’s uncertainty in the actual letter:

D: A or G or T
N: A or G or C or T
S: C or G
R: A or G

Let’s look at the other letters as well.

set_series[set_series.str.contains('N', regex=False)]

   {C, N, T, G, A}
   {C, N, G, T, A}
   {C, N, G, T, A}
   {C, N, G, T, A}
   {C, N, G, T, A}
  {C, N, G, T, A}
  {C, N, T, G, A}
  {C, N, G, T, A}
  {C, N, G, T, A}
  {C, N, G, T, A}
  {C, N, G, T, A}
Name: Sequence, dtype: object

df.loc[107].Sequence.strip()

'CACACAGGGCACCCCCTCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN'

set_series[set_series.str.contains('R', regex=False)]

1440    {C, G, T, A, R}
Name: Sequence, dtype: object

df.loc[1440].Sequence.strip()

'ATACCCCTTTTCACTTTCCCCACCTCTTAGGGTARTCAGTACTGGCGCTTTGAGGATGGT'

set_series[set_series.str.contains('S', regex=False)]

1441    {C, G, T, A, S}
Name: Sequence, dtype: object

df.loc[1441].Sequence.strip()

'CCCTCCTAATGCCCACCATCCCGTCCTCAGGGAAASAGTACTGGGAGTACCAGTTCCAGC'

OK, there aren’t too many rows with these missing values. We could remove them, but if there’s enough information in the rest of the sequence to distinguish the class, we would be throwing away useful data. If there’s not, the instance won’t have much effect on the model. So I’m going to keep them in.

Checking the Size

The dataset claims that every row has 60 characters (30 before and 30 after the possible splice. Let’s check to make sure that’s true.

df['Sequence'].str.strip().map(len).unique()

array([60], dtype=int64)

OK, looks good.

Cleaning and Transforming

Now we have the DNA sequences represented by a large string of letters. To use the sequence in machine learning, we’re going to need to remove the whitespace and separate each letter into a distinct place on a list. We’ll also need to convert the letters to integers. We’ll do that now.

letter_mapping = {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'D': 4, 'N': 5, 'R': 6, 'S': 7}

def convert_base_to_num(bases):
    return [letter_mapping[letter] for letter in bases.strip()]

df['Sequence_list'] = df['Sequence'].apply(convert_base_to_num)

df['Sequence_list'].head()

  [1, 1, 0, 2, 1, 3, 2, 1, 0, 3, 1, 0, 1, 0, 2, ...
  [0, 2, 0, 1, 1, 1, 2, 1, 1, 2, 2, 2, 0, 2, 2, ...
  [2, 0, 2, 2, 3, 2, 0, 0, 2, 2, 0, 1, 2, 3, 1, ...
  [2, 2, 2, 1, 3, 2, 1, 2, 3, 3, 2, 1, 3, 2, 2, ...
  [2, 1, 3, 1, 0, 2, 1, 1, 1, 1, 1, 0, 2, 2, 3, ...
Name: Sequence_list, dtype: object

Now we’ve converted the letters to integers. We could continue with this and train the model, but that has downsides. Even though the values are integers, the data are still categorical because each number represents a category, not a quantitative or ordinal value. For example, A is 0, C is 1, and G is 2, but this doesn’t mean that A is closer to or more similar to C than it is to G. But our algorithm won’t know that and could be misled, thinking the data are quantitative when they’re not.

To avoid that, we’ll use one-hot encoding.

We have to be careful with data types. We currently have a bunch of integers inside a list within a pandas series. We want to split those lists so that each individual integer is in its own column. To do that, we’ll convert the pandas series into an ndarray.

X_sequence = np.array(df['Sequence_list'].values.tolist())

One-hot encoding our matrix will change the shape of it. Let’s check the shape now so we can compare.

X_sequence.shape

(3190, 60)

encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_sequence)
X_one_hot = X_encoded.toarray()

X_one_hot.shape

(3190, 287)

This array is ready to go. Let’s look at the labels next.

Putting it All Back Together

I was originally going to combine the X_donor array with X_one_hot as the final X-values in the dataset. But after testing some models on it, it looks like the X_donor data too accurately predicts the label. Using just that information I can predict 100% of the labels, so the sequence data becomes meaningless. To me, that means that the Instance data shouldn’t actually be used to predict the labels. So our final dataset will just be the sequence data.

#X = np.append(X_donor, X_sequence, axis=1)
X = X_one_hot

Now we save it off into a csv file to examine in the next notebook.

np.save('dna_cleaner.npy', X)

np.save('dna_clean_labels', y)