There’s an ongoing competition at Kaggle called Titanic: Machine Learning from Disaster where the goal is to build a model that predicts whether or not a passenger survived through data like name, age, gender, socio-economic class, etc. We’re given details of 891 passengers along with whether or not they survived to train our model. This is what the data looks like:
We will then use our model to predict whether or not the other 418 passengers survived. This is what our submission will look like:
Our score in the competition is simply our model’s accuracy (i.e. predicting 50% of the passengers’ outcomes correctly would be a 0.5).
As a baseline for the competition, we’re given a submission file that predicts every woman survives and every man doesn’t survive. Not surprisingly, this does fairly well with an accuracy of 0.76. Let’s see how much we can improve on this with machine learning!
First, we have to prepare our data. Unlike the datasets we worked with before, there are plenty of missing and inaccurate values. After all, no one was expecting to create this dataset. As our first step, we can remove outliers. Using a standard Q1 +/- 1.5(IQR) test, we find that the following samples have outliers.
Three have unusually high fares of 263 and the rest have unusually high numbers of siblings and spouses. We’ll remove all of them.
Next, we have to choose which features we want to include in our model. Running isnull().sum() on our data, we can see how much missing data there is across each feature:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Test: PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64 Train: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
Age and Cabin have too much missing data to easily deal with. Trying to impute it (fill in the missing values with a placeholder like mean or median) would probably be worse than not considering them in our model at all. Trying to incorporate names and ticket numbers into our model doesn’t seem reasonable. We’ll move forward with passenger class, sex, number of siblings and spouses, number of parents and children, and ticket fare. For our one piece of missing data in fare, we’ll impute it. Then, let’s run our data through a random forest model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 features = ["Pclass", "Sex", "SibSp", "Parch", "Fare"] # One-hot encode data (turn categorical features into separate binary features) X_train = pd.get_dummies(train[features]) y_train = train["Survived"] X_test = pd.get_dummies(test[features]) my_imputer = SimpleImputer() X_train_imputed = pd.DataFrame(my_imputer.fit_transform(X_train)) X_test_imputed = pd.DataFrame(my_imputer.transform(X_test)) # Imputation removed column names; put them back X_train_imputed.columns = X_train.columns X_test_imputed.columns = X_test.columns model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1) model.fit(X_train_imputed, y_train) predictions = model.predict(X_test_imputed)
Our submission ends up with a score of 0.77990, a small improvement of less than 2% over our baseline. What else can we do to improve this?