In this notebook, I’m going to look at the basics of cleaning data with Python. I will be using a dataset of people involved in the Enron scandal. I first saw this dataset in the Intro to Machine Learning class at Udacity.

Table of Contents

# Imports
import pickle
from os.path import join

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Loading the Data

The first step is to load the data. It’s saved as a pickle file, which is a filetype created by the pickle module - a Python module to efficiently serialize and de-serialize data. Serializing is a process of converting a Python data object, like a list or dictionary, into a stream of characters.

path = "datasets/Enron/"
file = "final_project_dataset.pkl"
with open(join(path, file), "rb") as f:
    enron_data = pickle.load(f)

Exploring the Data

Now let’s look at the data. First, we’ll see what type it is.

print("The dataset is a", type(enron_data))
The dataset is a <class 'dict'>

OK, we have a dictionary. That means we’ll have to find the values by calling the associated keys. Let’s see what different keys we have in the dataset.

print("There are {} entities in the dataset.\n".format(len(enron_data)))
print(enron_data.keys())
There are 146 entities in the dataset.

dict_keys(['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HAEDICKE MARK E', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'NOLES JAMES L', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMAISTRE CHARLES', 'DEFFNER JOSEPH M', 'KISHKILL JOSEPH G', 'WHALLEY LAWRENCE G', 'MCCONNELL MICHAEL S', 'PIRO JIM', 'DELAINEY DAVID W', 'SULLIVAN-SHAKLOVITZ COLLEEN', 'WROBEL BRUCE', 'LINDHOLM TOD A', 'MEYER JEROME J', 'LAY KENNETH L', 'BUTTS ROBERT H', 'OLSON CINDY K', 'MCDONALD REBECCA', 'CUMBERLAND MICHAEL S', 'GAHN ROBERT S', 'BADUM JAMES P', 'HERMANN ROBERT J', 'FALLON JAMES B', 'GATHMANN WILLIAM D', 'HORTON STANLEY C', 'BOWEN JR RAYMOND M', 'GILLIS JOHN', 'FITZGERALD JAY L', 'MORAN MICHAEL P', 'REDMOND BRIAN L', 'BAZELIDES PHILIP J', 'BELDEN TIMOTHY N', 'DIMICHELE RICHARD G', 'DURAN WILLIAM D', 'THORN TERENCE H', 'FASTOW ANDREW S', 'FOY JOE', 'CALGER CHRISTOPHER F', 'RICE KENNETH D', 'KAMINSKI WINCENTY J', 'LOCKHART EUGENE E', 'COX DAVID', 'OVERDYKE JR JERE C', 'PEREIRA PAULO V. FERRAZ', 'STABLER FRANK', 'SKILLING JEFFREY K', 'BLAKE JR. NORMAN P', 'SHERRICK JEFFREY B', 'PRENTICE JAMES', 'GRAY RODNEY', 'THE TRAVEL AGENCY IN THE PARK', 'UMANOFF ADAM S', 'KEAN STEVEN J', 'TOTAL', 'FOWLER PEGGY', 'WASAFF GEORGE', 'WHITE JR THOMAS E', 'CHRISTODOULOU DIOMEDES', 'ALLEN PHILLIP K', 'SHARP VICTORIA T', 'JAEDICKE ROBERT', 'WINOKUR JR. HERBERT S', 'BROWN MICHAEL', 'MCCLELLAN GEORGE', 'HUGHES JAMES A', 'REYNOLDS LAWRENCE', 'PICKERING MARK R', 'BHATNAGAR SANJAY', 'CARTER REBECCA C', 'BUCHANAN HAROLD G', 'YEAP SOON', 'MURRAY JULIA H', 'GARLAND C KEVIN', 'DODSON KEITH', 'YEAGER F SCOTT', 'HIRKO JOSEPH', 'DIETRICH JANET R', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'PAI LOU L', 'HAYSLETT RODERICK J', 'BAY FRANKLIN R', 'MCCARTY DANNY J', 'FUGH JOHN L', 'SCRIMSHAW MATTHEW', 'KOENIG MARK E', 'SAVAGE FRANK', 'IZZO LAWRENCE L', 'TILNEY ELIZABETH A', 'MARTIN AMANDA K', 'BUY RICHARD B', 'GRAMM WENDY L', 'CAUSEY RICHARD A', 'TAYLOR MITCHELL S', 'DONAHUE JR JEFFREY M', 'GLISAN JR BEN F'])

The keys are different people who worked at Enron. I see several names familiar from the Enron scandal, including Kenneth Lay, Jeffery Skilling, Andrew Fastow, and Cliff Baxter (who’s listed as John C Baxter). There’s also an entity named “The Travel Agency in the Park”. From footnote j in the original document, this business was co-owned by the sister of Enron’s former Chairman. It may be of interest to investigators, but it will mess up the machine learning algorithms as it’s not an employee, so I will remove it.

Now let’s see what information we have about each person. We’ll start with Ken Lay.

enron_data["LAY KENNETH L"]
{'salary': 1072321,
 'to_messages': 4273,
 'deferral_payments': 202911,
 'total_payments': 103559793,
 'loan_advances': 81525000,
 'bonus': 7000000,
 'email_address': 'kenneth.lay@enron.com',
 'restricted_stock_deferred': 'NaN',
 'deferred_income': -300000,
 'total_stock_value': 49110078,
 'expenses': 99832,
 'from_poi_to_this_person': 123,
 'exercised_stock_options': 34348384,
 'from_messages': 36,
 'other': 10359729,
 'from_this_person_to_poi': 16,
 'poi': True,
 'long_term_incentive': 3600000,
 'shared_receipt_with_poi': 2411,
 'restricted_stock': 14761694,
 'director_fees': 'NaN'}

There are checksums built into the dataset, like total_payments and total_stock_value. We should be able to calculate these from the other values to double-check the values.

We can also query for a specific value, like this:

print("Jeff Skilling's total payments were ${:,.0f}.".format(enron_data["SKILLING JEFFREY K"]["total_payments"]))
Jeff Skilling's total payments were $8,682,716.

Before we go any further, we’ll put the data into a pandas DataFrame to make it easier to work with.

# The keys of the dictionary are the people, so we'll want them to be the rows of the dataframe
df = pd.DataFrame.from_dict(enron_data, orient="index")
pd.set_option("display.max_columns", len(df.columns))
df.head()
salary to_messages deferral_payments total_payments loan_advances bonus email_address restricted_stock_deferred deferred_income total_stock_value expenses from_poi_to_this_person exercised_stock_options from_messages other from_this_person_to_poi poi long_term_incentive shared_receipt_with_poi restricted_stock director_fees
METTS MARK 365788 807 NaN 1061827 NaN 600000 mark.metts@enron.com NaN NaN 585062 94299 38 NaN 29 1740 1 False NaN 702 585062 NaN
BAXTER JOHN C 267102 NaN 1295738 5634343 NaN 1200000 NaN NaN -1386055 10623258 11200 NaN 6680544 NaN 2660303 NaN False 1586055 NaN 3942714 NaN
ELLIOTT STEVEN 170941 NaN NaN 211725 NaN 350000 steven.elliott@enron.com NaN -400729 6678735 78552 NaN 4890344 NaN 12961 NaN False NaN NaN 1788391 NaN
CORDES WILLIAM R NaN 764 NaN NaN NaN NaN bill.cordes@enron.com NaN NaN 1038185 NaN 10 651850 12 NaN 0 False NaN 58 386335 NaN
HANNON KEVIN P 243293 1045 NaN 288682 NaN 1500000 kevin.hannon@enron.com NaN -3117011 6391065 34039 32 5538001 32 11350 21 True 1617011 1035 853064 NaN

Sometimes it can be helpful to export the data to a spreadsheet. If you want to do that, you can do it like so:

df.to_csv('final_project_dataset.csv')

We can already see missing values, and it looks like they’re entered as NaN, which Python will see as a string and not recognized as a null value. We can do a quick replacement on those.

df = df.replace("NaN", np.nan)

Pandas makes summarizing the data simple. First, we’ll configure pandas to print in standard notation with no decimal places.

pd.options.display.float_format = "{:10,.0f}".format
df.describe()
salary to_messages deferral_payments total_payments loan_advances bonus restricted_stock_deferred deferred_income total_stock_value expenses from_poi_to_this_person exercised_stock_options from_messages other from_this_person_to_poi long_term_incentive shared_receipt_with_poi restricted_stock director_fees
count 95 86 39 125 4 82 18 49 126 95 86 102 86 93 86 66 86 110 17
mean 562,194 2,074 1,642,674 5,081,526 41,962,500 2,374,235 166,411 -1,140,475 6,773,957 108,729 65 5,987,054 609 919,065 41 1,470,361 1,176 2,321,741 166,805
std 2,716,369 2,583 5,161,930 29,061,716 47,083,209 10,713,328 4,201,494 4,025,406 38,957,773 533,535 87 31,062,007 1,841 4,589,253 100 5,942,759 1,178 12,518,278 319,891
min 477 57 -102,500 148 400,000 70,000 -7,576,788 -27,992,891 -44,093 148 0 3,285 12 2 0 69,223 2 -2,604,490 3,285
25% 211,816 541 81,573 394,475 1,600,000 431,250 -389,622 -694,862 494,510 22,614 10 527,886 23 1,215 1 281,250 250 254,018 98,784
50% 259,996 1,211 227,449 1,101,393 41,762,500 769,375 -146,975 -159,792 1,102,872 46,950 35 1,310,814 41 52,382 8 442,035 740 451,740 108,579
75% 312,117 2,635 1,002,672 2,093,263 82,125,000 1,200,000 -75,010 -38,346 2,949,847 79,952 72 2,547,724 146 362,096 25 938,672 1,888 1,002,370 113,784
max 26,704,229 15,149 32,083,396 309,886,585 83,925,000 97,343,619 15,456,290 -833 434,509,511 5,235,198 528 311,764,000 14,368 42,667,589 609 48,521,928 5,521 130,322,299 1,398,517

The highest salary was $26 million, and the highest value for total payments was $309 million. The value for total payments seems too high, even for Enron, so we’ll have to look into that.

Cleaning the Data

It looks like we have lots of integers and strings and a single column of booleans. The booleans column “poi” indicates whether the person is a “Person Of Interest”. This is the column we’ll be trying to predict using the other data. Every person in the DataFrame is marked as either a poi or not. Unfortunately, this is not the case with the other columns. From the first five rows, we can tell that the other columns have lots of missing data. Let’s look at how bad it is.

# Print the number of missing values
num_missing_values = df.isnull().sum()
print(num_missing_values)
salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
loan_advances                142
bonus                         64
email_address                 35
restricted_stock_deferred    128
deferred_income               97
total_stock_value             20
expenses                      51
from_poi_to_this_person       60
exercised_stock_options       44
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
long_term_incentive           80
shared_receipt_with_poi       60
restricted_stock              36
director_fees                129
dtype: int64

That’s a lot, and it isn’t easy to analyze a dataset with that many missing values. Let’s graph it to see what we’ve got. Remember there are 146 different people in this dataset.

fig, ax = plt.subplots(figsize=(16, 10))
x = np.arange(0, len(num_missing_values))
matplotlib.rcParams.update({"font.size": 18})
plt.xticks(x, (df.columns), rotation="vertical")

# create the bars
bars = plt.bar(x, num_missing_values, align="center", linewidth=0)

ax.set_ylabel("Number of missing values")

# remove the frame of the chart
for spine in plt.gca().spines.values():
    spine.set_visible(False)

# direct label each bar with Y axis values
for bar in bars:
    plt.gca().text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() - 5,
        str(int(bar.get_height())),
        ha="center",
        color="w",
        fontsize=16,
    )

plt.show()

png

The most common missing values are loan_advances and director_fees. I imagine these are likely to be zero for most employees. Based on the columns that have the most missing values, the complete lack of zeros in the dataset, and the way it’s presented in the spreadsheet, I think we can say that all NaN values should actually be zero. Let’s make that change.

df = df.fillna(0)
df.head()
salary to_messages deferral_payments total_payments loan_advances bonus email_address restricted_stock_deferred deferred_income total_stock_value expenses from_poi_to_this_person exercised_stock_options from_messages other from_this_person_to_poi poi long_term_incentive shared_receipt_with_poi restricted_stock director_fees
METTS MARK 365,788 807 0 1,061,827 0 600,000 mark.metts@enron.com 0 0 585,062 94,299 38 0 29 1,740 1 False 0 702 585,062 0
BAXTER JOHN C 267,102 0 1,295,738 5,634,343 0 1,200,000 0 0 -1,386,055 10,623,258 11,200 0 6,680,544 0 2,660,303 0 False 1,586,055 0 3,942,714 0
ELLIOTT STEVEN 170,941 0 0 211,725 0 350,000 steven.elliott@enron.com 0 -400,729 6,678,735 78,552 0 4,890,344 0 12,961 0 False 0 0 1,788,391 0
CORDES WILLIAM R 0 764 0 0 0 0 bill.cordes@enron.com 0 0 1,038,185 0 10 651,850 12 0 0 False 0 58 386,335 0
HANNON KEVIN P 243,293 1,045 0 288,682 0 1,500,000 kevin.hannon@enron.com 0 -3,117,011 6,391,065 34,039 32 5,538,001 32 11,350 21 True 1,617,011 1,035 853,064 0

Let’s look at the checksum values to make sure everything checks out. By “checksums”, I’m referring to values that are supposed to be the sum of other values in the table. They can be used to find errors in the data quickly. In this case, the total_payments field is supposed to be the sum of all the payment categories: salary, bonus, long_term_incentive, deferred_income, deferral_payments, loan_advances, other, expenses, and director_fees.

The original spreadsheet divides up the information into payments and stock value. We’ll do that now and add a separate category for the email.

payment_categories = [
    "salary",
    "bonus",
    "long_term_incentive",
    "deferred_income",
    "deferral_payments",
    "loan_advances",
    "other",
    "expenses",
    "director_fees",
    "total_payments",
]
stock_value_categories = [
    "exercised_stock_options",
    "restricted_stock",
    "restricted_stock_deferred",
    "total_stock_value",
]

Now let’s sum together all the payment categories (except total_payments) and compare it to the total payments. It should be the same value. We’ll print out any rows that aren’t the same.

# Look at the instances where the total we calculate is not equal to the total listed on the spreadsheet
df[df[payment_categories[:-1]].sum(axis="columns") != df["total_payments"]][payment_categories]
salary bonus long_term_incentive deferred_income deferral_payments loan_advances other expenses director_fees total_payments
BELFER ROBERT 0 0 0 0 -102,500 0 0 0 3,285 102,500
BHATNAGAR SANJAY 0 0 0 0 0 0 137,864 0 137,864 15,456,290
df[df[stock_value_categories[:-1]].sum(axis="columns") != df["total_stock_value"]][stock_value_categories]
exercised_stock_options restricted_stock restricted_stock_deferred total_stock_value
BELFER ROBERT 3,285 0 44,093 -44,093
BHATNAGAR SANJAY 2,604,490 -2,604,490 15,456,290 0

On the original spreadsheet, 102,500 is listed in Robert Belfer’s deferred income section, not in deferral payments. And 3,285 should be the expenses column, director fees should be 102,500, and total payments should be 3,285. It looks like everything shifted one column to the right. We’ll have to move it one to the left to fix it. The opposite happened to Sanjay Bhatnagar, so we’ll have to move his columns to the right to fix them. Let’s do that now.

df.loc["BELFER ROBERT"]
salary                               0
to_messages                          0
deferral_payments             -102,500
total_payments                 102,500
loan_advances                        0
bonus                                0
email_address                        0
restricted_stock_deferred       44,093
deferred_income                      0
total_stock_value              -44,093
expenses                             0
from_poi_to_this_person              0
exercised_stock_options          3,285
from_messages                        0
other                                0
from_this_person_to_poi              0
poi                              False
long_term_incentive                  0
shared_receipt_with_poi              0
restricted_stock                     0
director_fees                    3,285
Name: BELFER ROBERT, dtype: object

Unfortunately, the order of the columns in the actual spreadsheet is different than the one in this dataset, so I can’t use pop to push them all over one. I’ll have to manually fix every incorrect value.

df.loc[("BELFER ROBERT", "deferral_payments")] = 0
df.loc[("BELFER ROBERT", "total_payments")] = 3285
df.loc[("BELFER ROBERT", "restricted_stock_deferred")] = -44093
df.loc[("BELFER ROBERT", "deferred_income")] = -102500
df.loc[("BELFER ROBERT", "total_stock_value")] = 0
df.loc[("BELFER ROBERT", "expenses")] = 3285
df.loc[("BELFER ROBERT", "exercised_stock_options")] = 0
df.loc[("BELFER ROBERT", "restricted_stock")] = 44093
df.loc[("BELFER ROBERT", "director_fees")] = 102500
df.loc["BHATNAGAR SANJAY"]
salary                                                0
to_messages                                         523
deferral_payments                                     0
total_payments                               15,456,290
loan_advances                                         0
bonus                                                 0
email_address                sanjay.bhatnagar@enron.com
restricted_stock_deferred                    15,456,290
deferred_income                                       0
total_stock_value                                     0
expenses                                              0
from_poi_to_this_person                               0
exercised_stock_options                       2,604,490
from_messages                                        29
other                                           137,864
from_this_person_to_poi                               1
poi                                               False
long_term_incentive                                   0
shared_receipt_with_poi                             463
restricted_stock                             -2,604,490
director_fees                                   137,864
Name: BHATNAGAR SANJAY, dtype: object
df.loc[("BHATNAGAR SANJAY", "total_payments")] = 137864
df.loc[("BHATNAGAR SANJAY", "restricted_stock_deferred")] = -2604490
df.loc[("BHATNAGAR SANJAY", "total_stock_value")] = 15456290
df.loc[("BHATNAGAR SANJAY", "expenses")] = 137864
df.loc[("BHATNAGAR SANJAY", "exercised_stock_options")] = 15456290
df.loc[("BHATNAGAR SANJAY", "other")] = 0
df.loc[("BHATNAGAR SANJAY", "restricted_stock")] = 2604490
df.loc[("BHATNAGAR SANJAY", "director_fees")] = 0

Let’s check to make sure that fixes our problems.

print(df[df[payment_categories[:-1]].sum(axis="columns") != df["total_payments"]][payment_categories])
print(df[df[stock_value_categories[:-1]].sum(axis="columns") != df["total_stock_value"]][stock_value_categories])
Empty DataFrame
Columns: [salary, bonus, long_term_incentive, deferred_income, deferral_payments, loan_advances, other, expenses, director_fees, total_payments]
Index: []
Empty DataFrame
Columns: [exercised_stock_options, restricted_stock, restricted_stock_deferred, total_stock_value]
Index: []

Looks good!

Look for Anomalies in the Data

Now we’re going to use a couple of functions provided by the Udacity class. They help to get data out of the dictionaries and into a more usable form. Here they are:

def featureFormat(
    dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys=False
):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """

    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle

        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = list(dictionary.keys())

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print("error: key ", feature, " not present")
                return
            value = dictionary[key][feature]
            if value == "NaN" and remove_NaN:
                value = 0
            tmp_list.append(float(value))

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == "poi":
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append(np.array(tmp_list))

    return np.array(return_list)


def targetFeatureSplit(data):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append(item[0])
        features.append(item[1:])

    return target, features

One of the best ways to detect anomalies is to graph the data. Anomalies often stick out in these graphs. Let’s take a look at how salary correlates with bonus. I suspect it will be positive and fairly strong.

### read in data dictionary, convert to numpy array

features = ["salary", "bonus"]
data = featureFormat(enron_data, features)

for point in data:
    salary = point[0]
    bonus = point[1]
    plt.scatter(salary, bonus)

plt.xlabel("salary")
plt.ylabel("bonus")
plt.show()

png

OK, someone’s bonus and salary are way higher than everyone else’s. That looks suspicious so let’s take a look at it.

df[df["salary"] > 10000000]
salary to_messages deferral_payments total_payments loan_advances bonus email_address restricted_stock_deferred deferred_income total_stock_value expenses from_poi_to_this_person exercised_stock_options from_messages other from_this_person_to_poi poi long_term_incentive shared_receipt_with_poi restricted_stock director_fees
TOTAL 26,704,229 0 32,083,396 309,886,585 83,925,000 97,343,619 0 -7,576,788 -27,992,891 434,509,511 5,235,198 0 311,764,000 0 42,667,589 0 False 48,521,928 0 130,322,299 1,398,517

Ah, there’s an “employee” named “TOTAL” in the spreadsheet. Having a row that is the total of our other rows will mess up our statistics, so we’ll remove it. We’ll also remove The Travel Agency in the Park that we noticed earlier.

entries_to_delete = ["THE TRAVEL AGENCY IN THE PARK", "TOTAL"]
for entry in entries_to_delete:
    if entry in df.index:
        df = df.drop(entry)

Now let’s look again.

### read in data dictionary, convert to numpy array

features = ["salary", "bonus"]
# data = df["salary", "bonus"]

salary = df["salary"].values / 1000
bonus = df["bonus"].values / 1000
plt.scatter(salary, bonus)

plt.xlabel("Salary (thousands of dollars)")
plt.ylabel("Bonus (thousands of dollars)")
plt.show()

png

That looks better.

Now that we’ve cleaned up the data let’s save it as a CSV so we can pick it up and do some analysis on it another time.

clean_file = "clean_df.csv"
df.to_csv(path + clean_file, index_label="name")