For those learning Feature Engineering

15 min readFeb 20, 2023

FEATURE ENGINEERING DOESN’T HAVE TO BE COMPLICATED

In 2022, I reached out to my mentor and told him where I was currently in my Machine Learning Journey and asked him for help. He asked me a question and that question birthed this project:

What of feature engineering?

I said

I don’t know how to perform feature engineering.

3 months later, I sent this message:

So I took your advice and I’m doing a feature engineering course on Kaggle.

The only problem was that the project at the end of that Kaggle Course was too complex for me, and I felt out of place and stupid.

So, what did I do?

I revisited the Titanic Challenge and used it to learn a few feature engineering techniques that I had learnt in the course.

My first ever Machine Learning Project was the Kaggle Titanic Competition. I cried, I laughed and there were times when I wanted to scream BUT I DID NOT GIVE UP.

I pulled through and I learnt SO MUCH about Data Analysis and Feature Engineering from that project.

My first-ever submission outside of the tutorial did well but was very basic.

I’m going to take you on a step-by-step journey of my latest submission and the reason behind certain decisions.

My Notebook:

Olomo’s Titanic Data Analysis Feature Engineering | Kaggle

My entire notebook is divided into 8 phases:

IMPORT NECESSARY MODULES
LOAD DATASET
DATA CLEANING
EXPLORATIVE DATA ANALYSIS
OUTLIER DETECTION
FEATURE ENGINEERING
FEATURE SELECTION
MAKE PREDICTIONS

IMPORT NECESSARY MODULES

Before you start doing anything, you need to first download all the modules you deem necessary for your ML Notebook. As you progress in your notebook, you will add new modules and delete redundant ones.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier

pip install sklearn_evaluation

from sklearn_evaluation import plot

LOAD DATASET

Here, I loaded the dataset into a data frame and looked at the first few rows

df = pd.read_csv('/kaggle/input/titanic/train.csv')
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

There are 12 columns.

Let’s copy the dataframe to a new one so we can make changes to it.

data = df.copy()
data.pop("PassengerId")
data.head()

DATA CLEANING

Looking at the dataset above, you observe a few things:

There are missing values under the Cabin column
Both and Cabin and Ticket column have alphanumeric values

We’ll start by taking a look at the Cabin column

CABIN

Let’s take a look at its first 13 values

data["Cabin"][:13]

0      NaN
1      C85
2      NaN
3     C123
4      NaN
5      NaN
6      E46
7      NaN
8      NaN
9      NaN
10      G6
11    C103
12     NaN
Name: Cabin, dtype: object

How many unique values does this column have?

data["Cabin"].nunique()

There are too many unique values. Let’s see if we can just use the letters and disregard the numbers.

def get_cabin(v):
    n = 0
    for i in v:
        try: i + "2"
        except: n+=1
        else:
            v[n] = i[:1]
            n+=1
    return v

get_cabin(data["Cabin"])
data["Cabin"][:13]

0     NaN
1       C
2     NaN
3       C
4     NaN
5     NaN
6       E
7     NaN
8     NaN
9     NaN
10      G
11      C
12    NaN
Name: Cabin, dtype: object

Now that we have only the letters saved, let’s take a look at the number of unique values.

print("The number of Unique values in Cabin Column:  \n", data["Cabin"].nunique())
print("The Unique values in Cabin Column:  \n", data["Cabin"].unique())

The number of Unique values in Cabin Column:  
 8
The Unique values in Cabin Column:  
 [nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']

There are 7 different types of cabins [excluding Nan]. This seems reasonable.

Let’s take look at the Ticket Column

TICKET

Let’s take a look at its first 13 values

data["Ticket"][:13]

0            A/5 21171
1             PC 17599
2     STON/O2. 3101282
3               113803
4               373450
5               330877
6                17463
7               349909
8               347742
9               237736
10             PP 9549
11              113783
12           A/5. 2151

Since there are some rows with only numbers, let’s see if we can just use the numbers and disregard the letters.

How many unique values does this column have?

data["Ticket"].nunique()

There are a lot of unique values

Let’s save just the numbers in a new column called “Ticket_num”

g = []
for i in data["Ticket"]:
    if " " in i:
        # Append the last element in the split This ensures that even with Tickets with multiple spaces,
        # It is the ticket number that gets saved
        g.append(i.split(" ")[-1])
    else:
        g.append(i)
        
data["Ticket_num"] = g
data["Ticket_num"][:13]

0       21171
1       17599
2     3101282
3      113803
4      373450
5      330877
6       17463
7      349909
8      347742
9      237736
10       9549
11     113783
12       2151
Name: Ticket_num, dtype: object

How many unique values are in this column?

data["Ticket_num"].nunique()

The number of unique values reduced from 681 to 679 which isn’t a lot. There are still too many unique values, so I will be dropping this column and the Ticket Column

data.drop(["Ticket_num","Ticket"], axis = 1)

NOOW that that is out of the way, let us replace missing values

Replace missing values

How many missing values are in each column?

for i in data.columns:
    print(i, sum(data[i].isnull()))
data.shape

Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
Ticket_num 0

(891, 12)

I replaced the missing values with:

mean for Continuous Data,
mode for Categorical Data or
a new category if the number of missing values was large and the column contained Categorical data. [This was done to observe patterns within the missing values.}]

def replace_mean(df, column):
    #The mean of the column rounded to 2 decimal points
    mean = round((data[column].mean(axis = 0, skipna=True)), 2)
    #Replace missing values with the mean
    df[column] = df[column].replace(np.nan, mean)

def replace_mode(df, column, value):
    if value:
        #Replace missing values with the value
        df[column] = df[column].replace(np.nan, value)
    else:
        mode = (data[column].mode())[0]
        #Replace missing values with the mode
        df[column] = df[column].replace(np.nan, mode)

# Clean missing values in "Age"
replace_mean(data,"Age")


# Replace  missing  values in "Cabin" with "H" a new Cabin alphabet.
# I did this because of the huge amount of missing data in the column
replace_mode(data, "Cabin", "H")


# Replace the missing values with the column's mode, which in this case is "S"
# I did this because there are just 2 missing values
replace_mode(data, "Embarked", False)

Look at the unique values of the Cabin Column

data["Cabin"].unique()

array(['H', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

You can see there is a new type of Cabin, “H”

How many missing values are now in each column?

for i in data.columns:
    print(i, sum(data[i].isnull()))

Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked 0
Ticket_num 0

EXPLORATIVE DATA ANALYSIS

This step is the most important step of all. I performed data analysis of each column:

For Individual Columns with Categorical Data, I looked at the disparity between the number of people that survived in each category vs the number of people in that category [dead or alive] and then calculated the survival rate of each category and saved it in a dataframe.

For example, I looked at the number of people that survived in each Cabin, from Cabin A to Cabin T

get_cabin(data["Cabin"])
# The number of people in each Cabin
print(data.groupby('Cabin').count()['Survived'].reset_index())
# The number of people that survived in each Cabin 
print(data.groupby('Cabin').sum()['Survived'].reset_index())

The number of people in each Cabin
  Cabin  Survived
0     A        15
1     B        47
2     C        59
3     D        33
4     E        32
5     F        13
6     G         4
7     T         1

 The number of people that survived in each Cabin
  Cabin  Survived
0     A         7
1     B        35
2     C        35
3     D        25
4     E        24
5     F         8
6     G         2
7     T         0

cabin_survival_rate = data.groupby("Cabin").mean()["Survived"]
cabin_survival_rate

#Survival of Cabin A = 
#(Number of People who survived in Cabin A)/(Total number of people in Cabin A)
#The same for other Cabins

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
H    0.299854
T    0.000000
Name: Survived, dtype: float64

I did this for the following Columns:

Cabin
Pclass
Parch
SibSp
Embarked
For Individual Columns with Numerical Data:

AGE

I grouped the numbers in bins/brackets to ease my analysis. And THEN I looked at the disparity between the number of people that survived in each category vs the number of people in that category [dead or alive] and calculated the survival rate of each category and saved it in a dataframe.

Let’s check how many ages are in the dataset and the datatype of the column?

# Check how many ages are in the data
print(data["Age"].unique(),'\n')
print(data["Age"].dtype)

array([22.  , 38.  , 26.  , 35.  , 29.7 , 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

float64

What is the distribution of the ages?

sns.catplot(x= "Survived", y = "Age", data=data, height =10, aspect =2)

This plot is too scattered. Let’s group the ages by creating bins

# Let's group the ages together and then calculate the survival rate
bins = pd.IntervalIndex.from_tuples([(-np.inf, 1),(1, 5),(5, 16), (16, 27), (27, 39), (39, 49), (49, 69), (69, np.inf)])
bins

IntervalIndex([(-inf, 1.0], (1.0, 5.0], (5.0, 16.0], (16.0, 27.0], (27.0, 39.0], (39.0, 49.0], (49.0, 69.0], (69.0, inf]],
              closed='right',
              dtype='interval[float64]')

Let’s create a new feature that places each age in its appropriate bin(bracket)

Age Column:

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    29.7
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

data['age_bracket'] = pd.cut(data['Age'], bins)
data['age_bracket']

0      (16.0, 27.0]
1      (27.0, 39.0]
2      (16.0, 27.0]
3      (27.0, 39.0]
4      (27.0, 39.0]
           ...     
886    (16.0, 27.0]
887    (16.0, 27.0]
888    (27.0, 39.0]
889    (16.0, 27.0]
890    (27.0, 39.0]
Name: age_bracket, Length: 891, dtype: category
Categories (8, interval[float64]): [(-inf, 1.0] < (1.0, 5.0] < (5.0, 16.0] < (16.0, 27.0] < (27.0, 39.0] < (39.0, 49.0] < (49.0, 69.0] < (69.0, inf]]

How many people survived in each age bracket?

The number of people from each age bracket
    age_bracket  Survived
0   (-inf, 1.0]        14
1    (1.0, 5.0]        30
2   (5.0, 16.0]        56
3  (16.0, 27.0]       237
4  (27.0, 39.0]       391
5  (39.0, 49.0]        89
6  (49.0, 69.0]        67
7   (69.0, inf]         7 

The number of people that survived in each age bracket 
    age_bracket  Survived
0   (-inf, 1.0]        12
1    (1.0, 5.0]        19
2   (5.0, 16.0]        24
3  (16.0, 27.0]        86
4  (27.0, 39.0]       140
5  (39.0, 49.0]        34
6  (49.0, 69.0]        26
7   (69.0, inf]         1

Now let’s calculate the survival rate of each category and save it in a dataframe.

age_survival_rate = data.groupby("age_bracket").mean()["Survived"]
age_survival_rate

age_bracket
(-inf, 1.0]     0.857143
(1.0, 5.0]      0.633333
(5.0, 16.0]     0.428571
(16.0, 27.0]    0.362869
(27.0, 39.0]    0.358056
(39.0, 49.0]    0.382022
(49.0, 69.0]    0.388060
(69.0, inf]     0.142857
Name: Survived, dtype: float64

FARE

Is there any relationship between a passenger’s Fare and their Survival?

sns.catplot(x="Survived", y="Fare", data=data, kind="boxen");

Based on the width of the boxes, it looks like there are more people with high fares that survived compared to people with lower fares

For Data Analysis between a Categorical Column and a Numeric Column, Seaborn.catplot is great for that

sns.catplot(x="Pclass", y="Fare", data=data, kind="boxen");

As expected, the price for the Ticket Class reduces as the Class reduces from 1st to 3rd Class

I did this analysis for

PCLASS & FARE
AGE_BRACKET & FARE
CABIN & FARE
EMBARKED & FARE
Lastly, I performed data analysis on each gender

WOMEN

print("%d women were onboard"% data[data['Sex'] == 'female'].count()['Survived'])
print("%d women survived"% data[data['Sex'] == 'female'].sum()['Survived'])

314 women were onboard
233 women survived

WOMEN & PCLASS

Is there a relationship between the women who survived and their Pclass?

# The number of Women from each Pclass
cc_w = data[data['Sex'] == "female"]
print("The number of Women from each Pclass")
print(cc_w.groupby('Pclass').count()['Survived'].reset_index(),"\n")
# The number of Women that survived in each Pclass 
print("The number of Women that survived in each Pclass")
print(cc_w.groupby('Pclass').sum()['Survived'].reset_index())

The number of Women from each Pclass
   Pclass  Survived
0       1        94
1       2        76
2       3       144 

The number of Women that survived in each Pclass
   Pclass  Survived
0       1        91
1       2        70
2       3        72

It is very obvious that there is. Only 3 women in 1st Class died. JUST 3. While half of the women from 3rd Class died. That is a HUGE difference.

MEN

print("%d men were onboard"% data[data['Sex'] == 'male'].count()['Survived'])
print("%d men survived"% data[data['Sex'] == 'male'].sum()['Survived'])

577 men were onboard
109 men survived

MEN & PCLASS

Is there a relationship between the men who survived and their Pclass?

# The number of Men from each Pclass
cc_m = data[data['Sex'] == "male"]
print("The number of Men from each Pclass")
print(cc_m.groupby('Pclass').count()['Survived'].reset_index(),"\n")
# The number of Men that survived in each Pclass 
print("The number of Men that survived in each Pclass")
print(cc_m.groupby('Pclass').sum()['Survived'].reset_index())

The number of Men from each Pclass
   Pclass  Survived
0       1       122
1       2       108
2       3       347 

The number of Men that survived in each Pclass
   Pclass  Survived
0       1        45
1       2        17
2       3        47

Approximately 37% of men in 1st Class survived, that is WAAAY higher than the mere 16% that survived in 3rd class.

OUTLIER DETECTION

Here, we look at data points or rows that seem to be isolated from majority of the data.

Looking at the plot showing the relationship between Cabin and Fare, you notice 2 data points with extremely high fares, they are isolated from the rest of the group. Let’s explore them and then remove them.

data[data["Fare"] > 500]

Let’s remove them

index = data[data["Fare"] > 500].index
data.drop(index,inplace = True)

Let’s check if the rows have been dropped

data[data["Fare"] > 50

They have.

FEATURE ENGINEERING

Here we will create new features based on relationships observed during “Explorative Data Analysis”.

To make use of the “Name” Feature, I changed the text data to numerical data by saving the length of the names.

data["Name"] = [len(i) for i in data["Name"]]
data["Name"]

0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

I wrote a general function to help me calculate the survival rate of columns and save it in a new column, I called it:

The Survival Rate Function

def c_e_survival(df, c_e, c_e_survival_rate, e_survival_rate):
    #df = dataframe
    #c_e = column name
    #c_e_survival_rate = new column name
    #e_survival_rate = dataframe that includes the survival rate of the different categories
    
    #Create a new Column in the "df" dataframe
    #For each person store the survival rate of their Cabin 
    df[c_e_survival_rate] = [e_survival_rate[i] for i in df[c_e]]
    return df[c_e_survival_rate]

CABIN

c_e_survival(data, "Cabin", "Cabin_survival_rate", Cabin_survival_rate)

0      0.299854
1      0.593220
2      0.299854
3      0.593220
4      0.299854
         ...   
886    0.299854
887    0.744681
888    0.299854
889    0.593220
890    0.299854
Name: Cabin_survival_rate, Length: 891, dtype: float64

I did this for

PARCH
PCLASS
SIBSP
EMBARKED
AGE
For Feature Engineering between a Categorical Column and a Numeric Column, calculating the average is great.

AVERAGE FARE PER PCLASS

#Calculate the average fare of each Pclass
averagefare_pclass = data.groupby("Pclass").mean()["Fare"]
c_e_survival(data, 'Pclass', "averagefare_pclass", averagefare_pclass)

0      13.675550
1      78.124061
2      13.675550
3      78.124061
4      13.675550
         ...    
886    20.662183
887    78.124061
888    13.675550
889    78.124061
890    13.675550
Name: averagefare_pclass, Length: 888, dtype: float64

I did this for

PCLASS
CABIN
EMBARKED

TOTAL PCLASS

I created a new column based on the relationship between Sex and Pclass, and used it to replace the pclass_survival_rate column

#Women & Pclass
c_e_survival(cc_w, "Pclass", "pclass_survival_rate", women_pclass_survival_rate)

#Men & Pclass
c_e_survival(cc_m, "Pclass", "pclass_survival_rate", men_pclass_survival_rate)

# Let's join the dataframes
cc_t = pd.merge(cc_w, cc_m, how = "outer")
cc_t.head()

data["pclass_survival_rate"] = cc_t["pclass_survival_rate"]
data.head()

Normalize Name & Fare

features1 = ['Name', "Fare"]

# Standardize (Z-Score)
#A dataframe with only the columns: Name and Fare
X_scaled = data.loc[:, features1]

X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)
data["Name"] = X_scaled["Name"]
data["Fare"] = X_scaled["Fare"]

FEATURE SELECTION

Copy the data to a new dataframe do you can make changes to it

X = data.copy()
y = X.pop("Survived")

X.columns

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked', 'age_bracket', 'parch_survival_rate', 'cabin_survival_rate',
       'pclass_survival_rate', 'sibsp_survival_rate', 'embarked_survival_rate',
       'age_survival_rate', 'averagefare_pclass', 'averagefare_cabin',
       'averagefare_embarked'],
      dtype='object')

Let’s convert columns that have categorical data to numerical data so the model can work with it.

#Columns that hold categorical data
cat_columns = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Cabin',
       'Embarked','age_bracket']

#Transform the categorical data to numerical data
X = pd.get_dummies(X, columns = cat_columns)
X.head()

Which columns are relevant to the model?

#Split the data to two datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#instantiate your model
model = RandomForestClassifier()
model.fit(X_train, y_train)


# view the feature scores

feature_scores = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_scores

Sex_female                  0.130195
Name                        0.128378
Sex_male                    0.104672
Age                         0.102201
Fare                        0.101379
pclass_survival_rate        0.059431
averagefare_pclass          0.036245
age_survival_rate           0.029338
averagefare_cabin           0.026577
cabin_survival_rate         0.025676
Pclass_3                    0.022035
sibsp_survival_rate         0.021099
Cabin_H                     0.017925
averagefare_embarked        0.013357
parch_survival_rate         0.012342
age_bracket_(16.0, 27.0]    0.011965
SibSp_0                     0.011060
Pclass_2                    0.010540
Pclass_1                    0.010518
Embarked_S                  0.009884
age_bracket_(27.0, 39.0]    0.009185
embarked_survival_rate      0.008763
Parch_0                     0.008465
SibSp_1                     0.008208
Cabin_E                     0.006996
age_bracket_(49.0, 69.0]    0.006787
Embarked_C                  0.006397
age_bracket_(39.0, 49.0]    0.006323
age_bracket_(-inf, 1.0]     0.006230
Parch_1                     0.006099
Parch_2                     0.005278
Embarked_Q                  0.005011
age_bracket_(1.0, 5.0]      0.004864
Cabin_C                     0.004064
age_bracket_(5.0, 16.0]     0.003774
SibSp_2                     0.003685
Cabin_B                     0.003156
SibSp_3                     0.002012
Cabin_D                     0.001761
SibSp_4                     0.001520
Cabin_F                     0.001512
Cabin_A                     0.001358
age_bracket_(69.0, inf]     0.001057
Cabin_G                     0.000748
Parch_5                     0.000421
SibSp_5                     0.000358
Parch_6                     0.000356
SibSp_8                     0.000269
Cabin_T                     0.000225
Parch_3                     0.000186
Parch_4                     0.000115
dtype: float64

Let’s select the top features until we get to a feature importance with 0.00#####

features = ['Name', 'Fare','Sex_male', 'Sex_female','Age','pclass_survival_rate','age_survival_rate','averagefare_pclass',
            'averagefare_cabin','Pclass_3','Pclass_2','Pclass_1', 'cabin_survival_rate','sibsp_survival_rate',
            'embarked_survival_rate', 'averagefare_embarked']
X_train = X_train[features]
X_train.head()

Why aren’t we picking some of the features?

For me to pick a feature, I would have to pick the entirety of the feature. For example, Embarked_S scored very highly but I would have to pick Embarked_C and Embarked_Q but they have a low score. The same can be said about Cabin_H, SibSp_1 and Parch_0.

Fit the model and check the metrics.

model = RandomForestClassifier()
model.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix, classification_report


# Compute predictions over the prediction space: y_pred
X_test = X_test[features]
y_pred = model.predict(X_test)

# Print the confusionmatrix and the classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[147  23]
 [ 31  66]]
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       170
           1       0.74      0.68      0.71        97

    accuracy                           0.80       267
   macro avg       0.78      0.77      0.78       267
weighted avg       0.80      0.80      0.80       267

#What is the cross-validation score?
print(cross_val_score(model, X, y, cv=3, scoring="f1"))

[0.69902913 0.73873874 0.7706422 ]

TEST DATASET

You repeat the following for the test dataset:

LOAD DATASET
DATA CLEANING
FEATURE ENGINEERING
MAKE PREDICTIONS

MAKE PREDICTIONS

features = ['Name', 'Fare','Sex_male', 'Sex_female','Age','pclass_survival_rate','age_survival_rate','averagefare_pclass',
            'averagefare_cabin','Pclass_3','Pclass_2','Pclass_1', 'cabin_survival_rate','sibsp_survival_rate',
            'embarked_survival_rate', 'averagefare_embarked']

#Make predictions
predictions = model.predict(test_data[features])

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")

NOTE:

It is important that when you are replacing missing values for your test_data, you USE values gotten from your training data and NOT your test data. For example, I replaced the missing values in the “Age” column with mean gotten from the training data.
The survival rates of the columns in the test_data are gotten from the training data. You cannot calculate survival rates using the test_data because you are not given the “Survived” column.
When calculating the average fare per “column”, you use the values gotten from the training data NOT the test data.
If you noticed even before the “FEATURE ENGINEERING” phase, we created a few features. Feature Engineering can also happen during the “EXPLORATIVE DATA ANALYSIS” phase.
Please check the Kaggle notebook for more information. I excluded a few things from the notebook, here to reduce the length of this post.

Olomo’s Titanic Data Analysis Feature Engineering | Kaggle

THE END.

There are still many things that you can do. For example, you can bin the ticket numbers and analyze the probability of each ticket range’s survival rate.

Let me know your thoughts in the comments.

Thank you!!!! Till next time.