For those learning Feature Engineering
FEATURE ENGINEERING DOESN’T HAVE TO BE COMPLICATED
In 2022, I reached out to my mentor and told him where I was currently in my Machine Learning Journey and asked him for help. He asked me a question and that question birthed this project:
What of feature engineering?
I said
I don’t know how to perform feature engineering.
3 months later, I sent this message:
So I took your advice and I’m doing a feature engineering course on Kaggle.
The only problem was that the project at the end of that Kaggle Course was too complex for me, and I felt out of place and stupid.
So, what did I do?
I revisited the Titanic Challenge and used it to learn a few feature engineering techniques that I had learnt in the course.
My first ever Machine Learning Project was the Kaggle Titanic Competition. I cried, I laughed and there were times when I wanted to scream BUT I DID NOT GIVE UP.
I pulled through and I learnt SO MUCH about Data Analysis and Feature Engineering from that project.
My first-ever submission outside of the tutorial did well but was very basic.
I’m going to take you on a step-by-step journey of my latest submission and the reason behind certain decisions.
My Notebook:
Olomo’s Titanic Data Analysis Feature Engineering | Kaggle
My entire notebook is divided into 8 phases:
- IMPORT NECESSARY MODULES
- LOAD DATASET
- DATA CLEANING
- EXPLORATIVE DATA ANALYSIS
- OUTLIER DETECTION
- FEATURE ENGINEERING
- FEATURE SELECTION
- MAKE PREDICTIONS
IMPORT NECESSARY MODULES
Before you start doing anything, you need to first download all the modules you deem necessary for your ML Notebook. As you progress in your notebook, you will add new modules and delete redundant ones.
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
pip install sklearn_evaluation
from sklearn_evaluation import plot
LOAD DATASET
Here, I loaded the dataset into a data frame and looked at the first few rows
df = pd.read_csv('/kaggle/input/titanic/train.csv')
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
There are 12 columns.
Let’s copy the dataframe to a new one so we can make changes to it.
data = df.copy()
data.pop("PassengerId")
data.head()
DATA CLEANING
Looking at the dataset above, you observe a few things:
- There are missing values under the Cabin column
- Both and Cabin and Ticket column have alphanumeric values
We’ll start by taking a look at the Cabin column
CABIN
Let’s take a look at its first 13 values
data["Cabin"][:13]
0 NaN
1 C85
2 NaN
3 C123
4 NaN
5 NaN
6 E46
7 NaN
8 NaN
9 NaN
10 G6
11 C103
12 NaN
Name: Cabin, dtype: object
How many unique values does this column have?
data["Cabin"].nunique()
147
There are too many unique values. Let’s see if we can just use the letters and disregard the numbers.
def get_cabin(v):
n = 0
for i in v:
try: i + "2"
except: n+=1
else:
v[n] = i[:1]
n+=1
return v
get_cabin(data["Cabin"])
data["Cabin"][:13]
0 NaN
1 C
2 NaN
3 C
4 NaN
5 NaN
6 E
7 NaN
8 NaN
9 NaN
10 G
11 C
12 NaN
Name: Cabin, dtype: object
Now that we have only the letters saved, let’s take a look at the number of unique values.
print("The number of Unique values in Cabin Column: \n", data["Cabin"].nunique())
print("The Unique values in Cabin Column: \n", data["Cabin"].unique())
The number of Unique values in Cabin Column:
8
The Unique values in Cabin Column:
[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
There are 7 different types of cabins [excluding Nan]. This seems reasonable.
Let’s take look at the Ticket Column
TICKET
Let’s take a look at its first 13 values
data["Ticket"][:13]
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450
5 330877
6 17463
7 349909
8 347742
9 237736
10 PP 9549
11 113783
12 A/5. 2151
Since there are some rows with only numbers, let’s see if we can just use the numbers and disregard the letters.
How many unique values does this column have?
data["Ticket"].nunique()
681
There are a lot of unique values
Let’s save just the numbers in a new column called “Ticket_num”
g = []
for i in data["Ticket"]:
if " " in i:
# Append the last element in the split This ensures that even with Tickets with multiple spaces,
# It is the ticket number that gets saved
g.append(i.split(" ")[-1])
else:
g.append(i)
data["Ticket_num"] = g
data["Ticket_num"][:13]
0 21171
1 17599
2 3101282
3 113803
4 373450
5 330877
6 17463
7 349909
8 347742
9 237736
10 9549
11 113783
12 2151
Name: Ticket_num, dtype: object
How many unique values are in this column?
data["Ticket_num"].nunique()
679
The number of unique values reduced from 681 to 679 which isn’t a lot. There are still too many unique values, so I will be dropping this column and the Ticket Column
data.drop(["Ticket_num","Ticket"], axis = 1)
NOOW that that is out of the way, let us replace missing values
Replace missing values
How many missing values are in each column?
for i in data.columns:
print(i, sum(data[i].isnull()))
data.shape
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
Ticket_num 0
(891, 12)
I replaced the missing values with:
- mean for Continuous Data,
- mode for Categorical Data or
- a new category if the number of missing values was large and the column contained Categorical data. [This was done to observe patterns within the missing values.}]
def replace_mean(df, column):
#The mean of the column rounded to 2 decimal points
mean = round((data[column].mean(axis = 0, skipna=True)), 2)
#Replace missing values with the mean
df[column] = df[column].replace(np.nan, mean)
def replace_mode(df, column, value):
if value:
#Replace missing values with the value
df[column] = df[column].replace(np.nan, value)
else:
mode = (data[column].mode())[0]
#Replace missing values with the mode
df[column] = df[column].replace(np.nan, mode)
# Clean missing values in "Age"
replace_mean(data,"Age")
# Replace missing values in "Cabin" with "H" a new Cabin alphabet.
# I did this because of the huge amount of missing data in the column
replace_mode(data, "Cabin", "H")
# Replace the missing values with the column's mode, which in this case is "S"
# I did this because there are just 2 missing values
replace_mode(data, "Embarked", False)
Look at the unique values of the Cabin Column
data["Cabin"].unique()
array(['H', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)
You can see there is a new type of Cabin, “H”
How many missing values are now in each column?
for i in data.columns:
print(i, sum(data[i].isnull()))
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
Ticket_num 0
EXPLORATIVE DATA ANALYSIS
This step is the most important step of all. I performed data analysis of each column:
- For Individual Columns with Categorical Data, I looked at the disparity between the number of people that survived in each category vs the number of people in that category [dead or alive] and then calculated the survival rate of each category and saved it in a dataframe.
For example, I looked at the number of people that survived in each Cabin, from Cabin A to Cabin T
get_cabin(data["Cabin"])
# The number of people in each Cabin
print(data.groupby('Cabin').count()['Survived'].reset_index())
# The number of people that survived in each Cabin
print(data.groupby('Cabin').sum()['Survived'].reset_index())
The number of people in each Cabin
Cabin Survived
0 A 15
1 B 47
2 C 59
3 D 33
4 E 32
5 F 13
6 G 4
7 T 1
The number of people that survived in each Cabin
Cabin Survived
0 A 7
1 B 35
2 C 35
3 D 25
4 E 24
5 F 8
6 G 2
7 T 0
cabin_survival_rate = data.groupby("Cabin").mean()["Survived"]
cabin_survival_rate
#Survival of Cabin A =
#(Number of People who survived in Cabin A)/(Total number of people in Cabin A)
#The same for other Cabins
Cabin
A 0.466667
B 0.744681
C 0.593220
D 0.757576
E 0.750000
F 0.615385
G 0.500000
H 0.299854
T 0.000000
Name: Survived, dtype: float64
I did this for the following Columns:
- Cabin
- Pclass
- Parch
- SibSp
- Embarked
- For Individual Columns with Numerical Data:
AGE
I grouped the numbers in bins/brackets to ease my analysis. And THEN I looked at the disparity between the number of people that survived in each category vs the number of people in that category [dead or alive] and calculated the survival rate of each category and saved it in a dataframe.
Let’s check how many ages are in the dataset and the datatype of the column?
# Check how many ages are in the data
print(data["Age"].unique(),'\n')
print(data["Age"].dtype)
array([22. , 38. , 26. , 35. , 29.7 , 54. , 2. , 27. , 14. ,
4. , 58. , 20. , 39. , 55. , 31. , 34. , 15. , 28. ,
8. , 19. , 40. , 66. , 42. , 21. , 18. , 3. , 7. ,
49. , 29. , 65. , 28.5 , 5. , 11. , 45. , 17. , 32. ,
16. , 25. , 0.83, 30. , 33. , 23. , 24. , 46. , 59. ,
71. , 37. , 47. , 14.5 , 70.5 , 32.5 , 12. , 9. , 36.5 ,
51. , 55.5 , 40.5 , 44. , 1. , 61. , 56. , 50. , 36. ,
45.5 , 20.5 , 62. , 41. , 52. , 63. , 23.5 , 0.92, 43. ,
60. , 10. , 64. , 13. , 48. , 0.75, 53. , 57. , 80. ,
70. , 24.5 , 6. , 0.67, 30.5 , 0.42, 34.5 , 74. ])
float64
What is the distribution of the ages?
sns.catplot(x= "Survived", y = "Age", data=data, height =10, aspect =2)
This plot is too scattered. Let’s group the ages by creating bins
# Let's group the ages together and then calculate the survival rate
bins = pd.IntervalIndex.from_tuples([(-np.inf, 1),(1, 5),(5, 16), (16, 27), (27, 39), (39, 49), (49, 69), (69, np.inf)])
bins
IntervalIndex([(-inf, 1.0], (1.0, 5.0], (5.0, 16.0], (16.0, 27.0], (27.0, 39.0], (39.0, 49.0], (49.0, 69.0], (69.0, inf]],
closed='right',
dtype='interval[float64]')
Let’s create a new feature that places each age in its appropriate bin(bracket)
Age Column:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 29.7
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
data['age_bracket'] = pd.cut(data['Age'], bins)
data['age_bracket']
0 (16.0, 27.0]
1 (27.0, 39.0]
2 (16.0, 27.0]
3 (27.0, 39.0]
4 (27.0, 39.0]
...
886 (16.0, 27.0]
887 (16.0, 27.0]
888 (27.0, 39.0]
889 (16.0, 27.0]
890 (27.0, 39.0]
Name: age_bracket, Length: 891, dtype: category
Categories (8, interval[float64]): [(-inf, 1.0] < (1.0, 5.0] < (5.0, 16.0] < (16.0, 27.0] < (27.0, 39.0] < (39.0, 49.0] < (49.0, 69.0] < (69.0, inf]]
How many people survived in each age bracket?
The number of people from each age bracket
age_bracket Survived
0 (-inf, 1.0] 14
1 (1.0, 5.0] 30
2 (5.0, 16.0] 56
3 (16.0, 27.0] 237
4 (27.0, 39.0] 391
5 (39.0, 49.0] 89
6 (49.0, 69.0] 67
7 (69.0, inf] 7
The number of people that survived in each age bracket
age_bracket Survived
0 (-inf, 1.0] 12
1 (1.0, 5.0] 19
2 (5.0, 16.0] 24
3 (16.0, 27.0] 86
4 (27.0, 39.0] 140
5 (39.0, 49.0] 34
6 (49.0, 69.0] 26
7 (69.0, inf] 1
Now let’s calculate the survival rate of each category and save it in a dataframe.
age_survival_rate = data.groupby("age_bracket").mean()["Survived"]
age_survival_rate
age_bracket
(-inf, 1.0] 0.857143
(1.0, 5.0] 0.633333
(5.0, 16.0] 0.428571
(16.0, 27.0] 0.362869
(27.0, 39.0] 0.358056
(39.0, 49.0] 0.382022
(49.0, 69.0] 0.388060
(69.0, inf] 0.142857
Name: Survived, dtype: float64
FARE
Is there any relationship between a passenger’s Fare and their Survival?
sns.catplot(x="Survived", y="Fare", data=data, kind="boxen");
Based on the width of the boxes, it looks like there are more people with high fares that survived compared to people with lower fares
- For Data Analysis between a Categorical Column and a Numeric Column, Seaborn.catplot is great for that
sns.catplot(x="Pclass", y="Fare", data=data, kind="boxen");
As expected, the price for the Ticket Class reduces as the Class reduces from 1st to 3rd Class
I did this analysis for
- PCLASS & FARE
- AGE_BRACKET & FARE
- CABIN & FARE
- EMBARKED & FARE
- Lastly, I performed data analysis on each gender
WOMEN
print("%d women were onboard"% data[data['Sex'] == 'female'].count()['Survived'])
print("%d women survived"% data[data['Sex'] == 'female'].sum()['Survived'])
314 women were onboard
233 women survived
WOMEN & PCLASS
Is there a relationship between the women who survived and their Pclass?
# The number of Women from each Pclass
cc_w = data[data['Sex'] == "female"]
print("The number of Women from each Pclass")
print(cc_w.groupby('Pclass').count()['Survived'].reset_index(),"\n")
# The number of Women that survived in each Pclass
print("The number of Women that survived in each Pclass")
print(cc_w.groupby('Pclass').sum()['Survived'].reset_index())
The number of Women from each Pclass
Pclass Survived
0 1 94
1 2 76
2 3 144
The number of Women that survived in each Pclass
Pclass Survived
0 1 91
1 2 70
2 3 72
It is very obvious that there is. Only 3 women in 1st Class died. JUST 3. While half of the women from 3rd Class died. That is a HUGE difference.
MEN
print("%d men were onboard"% data[data['Sex'] == 'male'].count()['Survived'])
print("%d men survived"% data[data['Sex'] == 'male'].sum()['Survived'])
577 men were onboard
109 men survived
MEN & PCLASS
Is there a relationship between the men who survived and their Pclass?
# The number of Men from each Pclass
cc_m = data[data['Sex'] == "male"]
print("The number of Men from each Pclass")
print(cc_m.groupby('Pclass').count()['Survived'].reset_index(),"\n")
# The number of Men that survived in each Pclass
print("The number of Men that survived in each Pclass")
print(cc_m.groupby('Pclass').sum()['Survived'].reset_index())
The number of Men from each Pclass
Pclass Survived
0 1 122
1 2 108
2 3 347
The number of Men that survived in each Pclass
Pclass Survived
0 1 45
1 2 17
2 3 47
Approximately 37% of men in 1st Class survived, that is WAAAY higher than the mere 16% that survived in 3rd class.
OUTLIER DETECTION
Here, we look at data points or rows that seem to be isolated from majority of the data.
Looking at the plot showing the relationship between Cabin and Fare, you notice 2 data points with extremely high fares, they are isolated from the rest of the group. Let’s explore them and then remove them.
data[data["Fare"] > 500]
Let’s remove them
index = data[data["Fare"] > 500].index
data.drop(index,inplace = True)
Let’s check if the rows have been dropped
data[data["Fare"] > 50
They have.
FEATURE ENGINEERING
Here we will create new features based on relationships observed during “Explorative Data Analysis”.
- To make use of the “Name” Feature, I changed the text data to numerical data by saving the length of the names.
data["Name"] = [len(i) for i in data["Name"]]
data["Name"]
0 23
1 51
2 22
3 44
4 24
..
886 21
887 28
888 40
889 21
890 19
Name: Name, Length: 891, dtype: int64
I wrote a general function to help me calculate the survival rate of columns and save it in a new column, I called it:
The Survival Rate Function
def c_e_survival(df, c_e, c_e_survival_rate, e_survival_rate):
#df = dataframe
#c_e = column name
#c_e_survival_rate = new column name
#e_survival_rate = dataframe that includes the survival rate of the different categories
#Create a new Column in the "df" dataframe
#For each person store the survival rate of their Cabin
df[c_e_survival_rate] = [e_survival_rate[i] for i in df[c_e]]
return df[c_e_survival_rate]
CABIN
c_e_survival(data, "Cabin", "Cabin_survival_rate", Cabin_survival_rate)
0 0.299854
1 0.593220
2 0.299854
3 0.593220
4 0.299854
...
886 0.299854
887 0.744681
888 0.299854
889 0.593220
890 0.299854
Name: Cabin_survival_rate, Length: 891, dtype: float64
I did this for
- PARCH
- PCLASS
- SIBSP
- EMBARKED
- AGE
- For Feature Engineering between a Categorical Column and a Numeric Column, calculating the average is great.
AVERAGE FARE PER PCLASS
#Calculate the average fare of each Pclass
averagefare_pclass = data.groupby("Pclass").mean()["Fare"]
c_e_survival(data, 'Pclass', "averagefare_pclass", averagefare_pclass)
0 13.675550
1 78.124061
2 13.675550
3 78.124061
4 13.675550
...
886 20.662183
887 78.124061
888 13.675550
889 78.124061
890 13.675550
Name: averagefare_pclass, Length: 888, dtype: float64
I did this for
- PCLASS
- CABIN
- EMBARKED
TOTAL PCLASS
I created a new column based on the relationship between Sex and Pclass, and used it to replace the pclass_survival_rate column
#Women & Pclass
c_e_survival(cc_w, "Pclass", "pclass_survival_rate", women_pclass_survival_rate)
#Men & Pclass
c_e_survival(cc_m, "Pclass", "pclass_survival_rate", men_pclass_survival_rate)
# Let's join the dataframes
cc_t = pd.merge(cc_w, cc_m, how = "outer")
cc_t.head()
data["pclass_survival_rate"] = cc_t["pclass_survival_rate"]
data.head()
Normalize Name & Fare
features1 = ['Name', "Fare"]
# Standardize (Z-Score)
#A dataframe with only the columns: Name and Fare
X_scaled = data.loc[:, features1]
X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)
data["Name"] = X_scaled["Name"]
data["Fare"] = X_scaled["Fare"]
FEATURE SELECTION
Copy the data to a new dataframe do you can make changes to it
X = data.copy()
y = X.pop("Survived")
X.columns
Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
'Embarked', 'age_bracket', 'parch_survival_rate', 'cabin_survival_rate',
'pclass_survival_rate', 'sibsp_survival_rate', 'embarked_survival_rate',
'age_survival_rate', 'averagefare_pclass', 'averagefare_cabin',
'averagefare_embarked'],
dtype='object')
Let’s convert columns that have categorical data to numerical data so the model can work with it.
#Columns that hold categorical data
cat_columns = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Cabin',
'Embarked','age_bracket']
#Transform the categorical data to numerical data
X = pd.get_dummies(X, columns = cat_columns)
X.head()
Which columns are relevant to the model?
#Split the data to two datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#instantiate your model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# view the feature scores
feature_scores = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_scores
Sex_female 0.130195
Name 0.128378
Sex_male 0.104672
Age 0.102201
Fare 0.101379
pclass_survival_rate 0.059431
averagefare_pclass 0.036245
age_survival_rate 0.029338
averagefare_cabin 0.026577
cabin_survival_rate 0.025676
Pclass_3 0.022035
sibsp_survival_rate 0.021099
Cabin_H 0.017925
averagefare_embarked 0.013357
parch_survival_rate 0.012342
age_bracket_(16.0, 27.0] 0.011965
SibSp_0 0.011060
Pclass_2 0.010540
Pclass_1 0.010518
Embarked_S 0.009884
age_bracket_(27.0, 39.0] 0.009185
embarked_survival_rate 0.008763
Parch_0 0.008465
SibSp_1 0.008208
Cabin_E 0.006996
age_bracket_(49.0, 69.0] 0.006787
Embarked_C 0.006397
age_bracket_(39.0, 49.0] 0.006323
age_bracket_(-inf, 1.0] 0.006230
Parch_1 0.006099
Parch_2 0.005278
Embarked_Q 0.005011
age_bracket_(1.0, 5.0] 0.004864
Cabin_C 0.004064
age_bracket_(5.0, 16.0] 0.003774
SibSp_2 0.003685
Cabin_B 0.003156
SibSp_3 0.002012
Cabin_D 0.001761
SibSp_4 0.001520
Cabin_F 0.001512
Cabin_A 0.001358
age_bracket_(69.0, inf] 0.001057
Cabin_G 0.000748
Parch_5 0.000421
SibSp_5 0.000358
Parch_6 0.000356
SibSp_8 0.000269
Cabin_T 0.000225
Parch_3 0.000186
Parch_4 0.000115
dtype: float64
Let’s select the top features until we get to a feature importance with 0.00#####
features = ['Name', 'Fare','Sex_male', 'Sex_female','Age','pclass_survival_rate','age_survival_rate','averagefare_pclass',
'averagefare_cabin','Pclass_3','Pclass_2','Pclass_1', 'cabin_survival_rate','sibsp_survival_rate',
'embarked_survival_rate', 'averagefare_embarked']
X_train = X_train[features]
X_train.head()
Why aren’t we picking some of the features?
For me to pick a feature, I would have to pick the entirety of the feature. For example, Embarked_S scored very highly but I would have to pick Embarked_C and Embarked_Q but they have a low score. The same can be said about Cabin_H, SibSp_1 and Parch_0.
Fit the model and check the metrics.
model = RandomForestClassifier()
model.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix, classification_report
# Compute predictions over the prediction space: y_pred
X_test = X_test[features]
y_pred = model.predict(X_test)
# Print the confusionmatrix and the classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[147 23]
[ 31 66]]
precision recall f1-score support
0 0.83 0.86 0.84 170
1 0.74 0.68 0.71 97
accuracy 0.80 267
macro avg 0.78 0.77 0.78 267
weighted avg 0.80 0.80 0.80 267
#What is the cross-validation score?
print(cross_val_score(model, X, y, cv=3, scoring="f1"))
[0.69902913 0.73873874 0.7706422 ]
TEST DATASET
You repeat the following for the test dataset:
- LOAD DATASET
- DATA CLEANING
- FEATURE ENGINEERING
- MAKE PREDICTIONS
MAKE PREDICTIONS
features = ['Name', 'Fare','Sex_male', 'Sex_female','Age','pclass_survival_rate','age_survival_rate','averagefare_pclass',
'averagefare_cabin','Pclass_3','Pclass_2','Pclass_1', 'cabin_survival_rate','sibsp_survival_rate',
'embarked_survival_rate', 'averagefare_embarked']
#Make predictions
predictions = model.predict(test_data[features])
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")
NOTE:
- It is important that when you are replacing missing values for your test_data, you USE values gotten from your training data and NOT your test data. For example, I replaced the missing values in the “Age” column with mean gotten from the training data.
- The survival rates of the columns in the test_data are gotten from the training data. You cannot calculate survival rates using the test_data because you are not given the “Survived” column.
- When calculating the average fare per “column”, you use the values gotten from the training data NOT the test data.
- If you noticed even before the “FEATURE ENGINEERING” phase, we created a few features. Feature Engineering can also happen during the “EXPLORATIVE DATA ANALYSIS” phase.
- Please check the Kaggle notebook for more information. I excluded a few things from the notebook, here to reduce the length of this post.
Olomo’s Titanic Data Analysis Feature Engineering | Kaggle
THE END.
There are still many things that you can do. For example, you can bin the ticket numbers and analyze the probability of each ticket range’s survival rate.
Let me know your thoughts in the comments.
Thank you!!!! Till next time.