ConocoPhillips Datathon Challenge¶

ConocoPhillips give a challenge for The TAMU Datathon to make a predictor for whether an oil rig would fail given a set of sensor values (107 columns).

Here is our naive exploration of creating various classifiers to predict of a given state of sensor data would lead to a rig failure.

Authors¶

Aditya Pethe
Anikait Sharma
George Thayamkery
Jon Waterman

import pandas as pd
import numpy as np
import sklearn as sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from matplotlib import pyplot as plt

# cleans training data
def clean_data(training):
    # change na's to 0
    replace_na = training.replace('na',0)
    return replace_na


def read_file(filename):
    raw_df = pd.read_csv(filename)
    return raw_df

def split_response(raw_df):
    # get target vector
    response = raw_df['target']

    # get df of training data
    training = raw_df.drop(columns="target")
    
    return response, training

# our simplest regression method (Logistical Regression)
def logistic_regression(training,response):
    X = training.drop(columns="id")[:int(len(training) * 0.7)]
    y = response[:int(len(training) * 0.7)]
    
    model = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X, y)
    return model

#Random Forest model
def RandomForest(training,response):
    X = training.drop(columns="id")[:int(len(training) * 0.7)] # train on first 70% of data
    y = response[:int(len(training) * 0.7)]
    RF = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0).fit(X, y)

    return RF

#support vector machines
def vector_machine(training,response):
    X = training.drop(columns="id")[:int(len(training) * 0.7)] # train on 70% of data
    Y = response[:int(len(training) * 0.7)]

    SVM = svm.LinearSVC()
    SVM.fit(X,Y)

    return round(SVM.score(X,Y),4)
    #SVM.predict(X.iloc[2:,:])

def produce_prediction_vector(model, test_data):
    result =model.predict(test_data)
    # print("length:", len(result))
    print("length:", len(result))
    # test_data = pd.DataFrame(result.T)
    final_str = "id,target\n"
    for i, r in enumerate(result):
        final_str += str(i+1) + "," + str(r) + "\n"
    
    f = open("result4.csv", "w+")
    f.write(final_str)
    f.close()

Clean data and make models¶

We replaced all the na sensor readings in the data to zeroes and created Logistic Regression, Random Forest, and Vector Machine models.

# read in training set
filename = "equip_failures_training_set.csv"
df = read_file(filename) 
response, training = split_response(df) # for now training contains ID column, but is dropped when training 

# clean training data
training = clean_data(training)

# generate logistic regression model
model = logistic_regression(training,response)

# generate random forest model
rf_model = RandomForest(training,response)

# generate SVM model
svm_model = vector_machine(training,response)

Results¶

We got the best results when training our models on 70% of the given training set and testing it on the remaining 30% of the training set. This is apparently a strategy real data scientists use to prevent overfitting.

We did not spend to much time tweaking things but the Random Forest model performed the best (especially after only training it on 70% of the training set).

# test LR against training data (the remaining 30% after training after first 70%)
print("testing LR against training:",model.score(training.drop(columns="id")[int(len(training) * 0.7):], response[int(len(training) * 0.7):]))

# test RF against training data (the remaining 30% after training after first 70%) (Best model we came up with!!!)
print("testing RF against training:",rf_model.score(training.drop(columns="id")[int(len(training) * 0.7):], response[int(len(training) * 0.7):]))

# test SVM against training data (the remaining 30% after training after first 70%)
print("testing SVM against training:",rf_model.score(training.drop(columns="id")[int(len(training) * 0.7):], response[int(len(training) * 0.7):]))

testing LR against training: 0.9896666666666667
testing RF against training: 0.9933888888888889
testing SVM against training: 0.9933888888888889

Using the best model to predict the test set¶

In the end, we placed 38/70 (accuracy: 0.992) on the kaggle for this challenge (which is higher then we certainly expected) using the random forest model.

# the RF model is the best, so lets put it up against the test dataset and upload that vector to kaggle

# test RF model on test file and put resulting vector into csv
filename = "equip_failures_test_set.csv"

test_df = read_file(filename)
test_df = clean_data(test_df)

# run model through test dataframe and produce csv
produce_prediction_vector(rf_model, test_df.drop(columns="id"))