728x90

task_ML

Description of Problem¶

Suppose that you have a simple dataset consisting of 10 features, where each feature denotes a sensor, and there are 10 sensors in the dataset. Each sensor generates a signal as a float number to be ranged between 0 and 1, and the signal values are labelled as either -1 or 1. The sensors will produce differently distributed signal values. There are 400 samples along the row.

You can download the dataset at the link below

Sensor-Signals Dataset

The goal of this problem is to rank sensors based on their predictive power, and the ranked sensors can be plotted in descending order. There could be many ways to score the feature importance for sensors, but here we will solve this problem by using information gain(IG) to score feature importance and to rank sensors.

Load data¶

First load the dataset

In [1]:

import os
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import matplotlib as mpl
import scipy.stats as st
from math import log
from collections import OrderedDict
from collections import Counter

from IPython.display import Image

# set tick label size for all figures in this notebook 
label_size = 13
mpl.rcParams['xtick.labelsize'] = label_size 
mpl.rcParams['ytick.labelsize'] = label_size 

Start by creating SensorReadings class, where __init__ method loads task_data and stores all sensor readings to a dataframe

In [2]:

class SensorReadings:
    def __init__(self, filename): 
        
        # read task data 
        self.df = pd.read_csv(filename)
    
        # sensor names column
        self.sensor_index = self.df.columns[2:]

In [3]:

filename = os.path.join(os.getcwd(),'sensors_dataset.csv')

sensorData = SensorReadings(filename=filename)

display(sensorData.df)

	sample index	class_label	sensor0	sensor1	sensor2	sensor3	sensor4	sensor5	sensor6	sensor7	sensor8	sensor9
0	sample0	1.0	0.834251	0.726081	0.535904	0.214896	0.873788	0.767605	0.111308	0.557526	0.599650	0.665569
1	sample1	1.0	0.804059	0.253135	0.869867	0.334285	0.604075	0.494045	0.833575	0.194190	0.014966	0.802918
2	sample2	1.0	0.694404	0.595777	0.581294	0.799003	0.762857	0.651393	0.075905	0.007186	0.659633	0.831009
3	sample3	1.0	0.783690	0.038780	0.285043	0.627305	0.800620	0.486340	0.827723	0.339807	0.731343	0.892359
4	sample4	1.0	0.788835	0.174433	0.348770	0.938244	0.692065	0.377620	0.183760	0.616805	0.492899	0.930969
...	...	...	...	...	...	...	...	...	...	...	...	...
395	sample395	-1.0	0.433150	0.816109	0.452945	0.065469	0.237093	0.719321	0.577969	0.085598	0.357115	0.070060
396	sample396	-1.0	0.339346	0.914610	0.097827	0.077522	0.484140	0.690568	0.420054	0.482845	0.395148	0.438641
397	sample397	-1.0	0.320118	0.444951	0.401896	0.970993	0.960264	0.138345	0.354927	0.230749	0.204612	0.558889
398	sample398	-1.0	0.059132	0.337426	0.772847	0.099038	0.966042	0.975086	0.532891	0.035839	0.258723	0.709958
399	sample399	-1.0	0.379778	0.460256	0.229257	0.768975	0.321882	0.118572	0.448964	0.546324	0.363127	0.176632

400 rows × 12 columns

The list of Sensors are like below

In [4]:

sensorData.sensor_index

Out[4]:

Index(['sensor0', 'sensor1', 'sensor2', 'sensor3', 'sensor4', 'sensor5',
       'sensor6', 'sensor7', 'sensor8', 'sensor9'],
      dtype='object')

Visualizing sensor signals¶

To visualize individual sensor signals, the helper function is written. This function is bounded to the SensorReadings class

In [5]:

def plot_sensor_readings(self):
    fig = plt.figure(figsize=(10, 9), constrained_layout=False)
    
    n_row, n_col = len(self.sensor_index)//2, 2
    
    axs = fig.subplots(n_row, n_col)
    plt.suptitle("individual sensors readings", fontsize=20)

    color = 'tab:blue'
    color_dualAxs = 'tab:red'
    
    idx_pair = 0 
    for r in range(n_row):         
        for c in range(n_col):
            axs[r, c].set_title(self.sensor_index[idx_pair], 
                                y=1.3,pad=-14,c='k',fontsize=15)

            axs[r, c].plot(self.df[self.sensor_index[idx_pair]], 
                           'o', markersize=4, color=color)
            axs[r, c].set_ylabel('sensor readings',color=color, fontsize=10)
            axs[r, c].tick_params(axis='y', labelcolor=color)      
            axs[r, c].set_yticks([0, 0.5, 1])

            axs_dual = axs[r, c].twinx()
            axs_dual.set_ylabel('class label', color=color_dualAxs, fontsize=10)  
            axs_dual.plot(self.df.class_label, color=color_dualAxs, linewidth=3)
            axs_dual.tick_params(axis='y', labelcolor=color_dualAxs)
            axs_dual.set_yticks([-1, 1])
            
            if r < n_row-1: 
                axs[r, c].set_xticklabels([])
            
            idx_pair +=1

    axs[n_row-1, 0].set_xlabel('sample index',fontsize=12)  
    axs[n_row-1, 1].set_xlabel('sample index',fontsize=12)  

    plt.tight_layout()

Bound this helper function to SensorReadings class

In [6]:

SensorReadings.plot_sensor_readings = plot_sensor_readings

In [7]:

sensorData.plot_sensor_readings()

As seen from all sensor signals,

First 200 signals belong to class label 1
Second 200 signals belongs to -1

To become a good classifier, the sensor should be able to generate reading values in reasonably well separated distributions.

Signals from sensor 0, 2, 4, 6, 8 seems to be well classified
Other sensors like 1, 7, and 9 do not seem to classify well

Split dataset into features and target¶

Separate class labels from sensor readings:

y for target labels
X for input features

In [8]:

y = sensorData.df['class_label'].values
X = sensorData.df[sensorData.sensor_index].values

print(f'Shape of sensor readings {X.shape}')
print(f'Shape of class labels {y.shape}')

Shape of sensor readings (400, 10)
Shape of class labels (400,)

Clustering algorithm¶

Principal Component Analysis¶

For further understanding on the given data, we can implement principal component analysis (PCA).

The PCA is a basically dimensionality reduction method, by which the given data can be projected in the plane of two major axes that best describe a distribution of the data.

Then, we can check how well the projected data is grouped into two classes (1 and -1).

In [9]:

from sklearn.decomposition import PCA 

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X)
df_PC = pd.DataFrame(data=principalComponents, columns = ['PC1', 'PC2'])
df_PC['true_label'] = sensorData.df['class_label']
display(df_PC)

	PC1	PC2	true_label
0	-0.307421	-0.210411	1.0
1	0.148512	0.210618	1.0
2	-0.438685	-0.444488	1.0
3	-0.492798	0.568961	1.0
4	-0.473759	0.061023	1.0
...	...	...	...
395	0.482446	-0.194064	-1.0
396	0.357519	0.018492	-1.0
397	-0.100119	-0.143667	-1.0
398	0.139063	-0.068952	-1.0
399	0.185756	-0.013335	-1.0

400 rows × 3 columns

df_PC dataframe includes two principal components as well as corresponding true labels for 400 samples.

k-means clustering¶

We can apply k-means clustering method to group the projected data into two classes.

In [10]:

from sklearn.cluster import KMeans

# n_clusters is 2: 1 and -1 
kmeans = KMeans(n_clusters=2, random_state=0).fit(df_PC.iloc[:,:2])

# replace all label 0s to -1 
kmeans.labels_[kmeans.labels_==0] = -1

# add kmeans-labels to df_PC 
df_PC['labels_by_kmeans'] = kmeans.labels_

Visualizing clustering results¶

In [11]:

fig = plt.figure(figsize=(12, 5), constrained_layout=False)
axs = fig.subplots(1,2)

class1_true = df_PC[df_PC['true_label'] == 1]
class2_true = df_PC[df_PC['true_label'] == -1]

marker_size = 12
axs[0].scatter(class1_true['PC1'], class1_true['PC2'], c='tab:red', s=marker_size, label='label: 1')
axs[0].scatter(class2_true['PC1'], class2_true['PC2'], c='tab:blue', s=marker_size, label='label: -1')
axs[0].set_title("true labels", fontsize=15)
_= axs[0].set_xlabel('PC1', fontsize=13)
_= axs[0].set_ylabel('PC2', fontsize=13)
_= axs[0].legend(fontsize=14)


axs[1].set_title("labels predicted by k-means algorithm", fontsize=15)
class1_kmeans = df_PC[df_PC['labels_by_kmeans'] == 1]
class2_kmeans = df_PC[df_PC['labels_by_kmeans'] == -1]

marker_size = 12
axs[1].scatter(class1_kmeans['PC1'], class1_kmeans['PC2'], c='tab:red', s=marker_size, label='label: 1')
axs[1].scatter(class2_kmeans['PC1'], class2_kmeans['PC2'], c='tab:blue', s=marker_size, label='label: -1')
axs[1].scatter(kmeans.cluster_centers_[0][0],kmeans.cluster_centers_[0][1], c='k',
               s=100, marker='^', label='centroids')
axs[1].scatter(kmeans.cluster_centers_[1][0],kmeans.cluster_centers_[1][1], c='k', 
               marker='^',s=100)
_= axs[1].set_xlabel('PC1', fontsize=13)
_= axs[1].set_ylabel('PC2', fontsize=13)
_= axs[1].legend(fontsize=14,bbox_to_anchor=(1., 1), loc='upper left',)

The accuracy of label prediction by k-means method is pretty high

In [12]:

acc = np.count_nonzero(df_PC['true_label'] == df_PC['labels_by_kmeans'])/len(sensorData.df)
print("accuracy is {}".format(acc))

accuracy is 0.9125

As shown in PCA and k-means clustering, sensor readings can be successfully grouped into two classes by projecting them onto two major axes.

Now we can move on to ranking sensors

Ranking sensors¶

Helper functions¶

First prepare helper functions.

In [13]:

# This function plots obtained ranking scores in descending order.
def plotFeatureScores(scoring_method, y_label):
        
    scores = rank[scoring_method]
    keys, values = scores.keys(), scores.values()
    
    plt.figure(figsize=(13,4))
    plt.title(scoring_method, y=0.9, pad=-14, fontsize=15)
    plt.bar(keys, values, color='green')
    plt.ylabel(y_label, fontsize=15)
    plt.xticks(rotation=30)
    plt.grid(color='grey', linestyle='-', linewidth=0.5, axis='y')
    plt.show()
            

# sort ranking sensors by importance scores in descending order        
def sortScores(scores):
    return dict(sorted(scores.items(), key=lambda kv: kv[1], reverse=True))            

Create a dictionary variable to store ranking scores from all considered methods in this task.

In [14]:

rank = {}

Rename sensor list to feature_names list

In [15]:

feature_names = sensorData.df.columns[2:]

for f in feature_names:
    print(f)

sensor0
sensor1
sensor2
sensor3
sensor4
sensor5
sensor6
sensor7
sensor8
sensor9

Information Gain¶

We calculate information Gain(IG) to score importance of sensors. IG evaluates how much information is obtained when classifying class labels by individual features, not by all features. Thus, IG becomes a function of feature. That is, the most important feature will have the highest IG score among other features.

IG is defined as a difference in Shannon entropy before and after classifying by a feature:

$IG(\mathrm{feature}) = H_{\mathrm{before}} - H_{\mathrm{split \, by\, feature}}$

where $H$ is Shannon entropy given by $H = -\sum_{i}^{C} p_{i}\log_{\mathrm{base}} p_{i}$

where $C$ denotes classes and $p_{i}$ is a relative occurrence of class $i$ . The base of logarithmic function is determined by the number of classes.

For binary classification problem, the base is 2, which is the case of this task.

A Warm-up Example

Before getting into ranking sensor problem, for better understanding on IG scoring, lets consider a very simple binary classification problem in the figure below.

In [16]:

figure_path = os.path.join(os.getcwd(),'figures')
display(Image(filename=figure_path+'/binaryClassification.png', width=600))

Suppose that we have data consisting of five o and x each. Their distribution remains fixed, and Figure A is the one before classification.

case 0: before classification (Figure A)

Now we consider three classification cases, and we want to calculate IG for each case.

: after classified by $x_1$ (Figure B)
: after classified by $x_2$ (Figure C)
: after classified by $x_3$ (Figure D)

To obtain IG after each classification, $H_{before}$ (for case 0) and $H_{after}$ (for case 1,2, and 3) need to be calculated separately.

Figure A: entropy before classification: $H_{before}$

$p_{\mathrm{o}}$ = $p_{\mathrm{x}}$ = 1/2

Thus, $\begin{align} H_{before} &= -p(\mathrm{o})\log_2(p(\mathrm{o})) -p(\mathrm{x})\log_2(p(\mathrm{x})) \nonumber\\ &= -1/2\log_2(1/2) -1/2\log_2(1/2) \nonumber\\ &= 1\nonumber \end{align}$

This is the maximum entropy because the impurity is greatest.

Figure B: When classified by $x_1$

In left partition,

$H_{left}=0$

In right partition,

$p(\mathrm{x}) = 5/6, p(\mathrm{o}) = 1/6$

$H_{right} = -5/6\log_2(5/6) - 1/6\log_2(1/6) = 0.65$

$H_{after}$ is a weighted sum of two partitions

$\begin{align} H_{after} &= (4/10)H_{left} + (6/10)H_{right} \nonumber\\&= (4/10)*0 + (6/10)*0.65 \nonumber\\ &= 0.39 \nonumber\end{align}$

Thus, $IG(x_1) = H_{before} - H_{after} = 0.61$

Figure C: When classified by $x_2$

$H_{left}=H_{right}=0$

thus, $H_{after}=0$ ,

$IG(x_2) = 1$

meaning that the information gain is maximized by $x_2$

Figure D: When classified by $x_3$

$H_{left}=1$ and $H_{right}=0$

thus, $H_{after}=1$ $IG(x_3) = 0$

meaning that there is no information gain by $x_3$

the results are

$\begin{align} IG(x_1) &= 0.61 \nonumber\\ IG(x_2) &= 1 \nonumber\\ IG(x_3) &= 0 \nonumber\\ \end{align}$

As expected, $IG(x_2)$ is the highest because $x_2$ splits o and x equally.

So, $x_2$ is the most important feature.

Ranking Sensors Problem¶

Now lets solve the ranking-sensor problem. Like the example above, the plan is to calculate IG for each sensor. This will evaluate the relative importances of individual sensors, by which the sensors are ranked in terms of predictive power.

One thing to be noted is that our sensor signals are continuous values, so they need to be discretized. This can be done by binning values and assigning corresponding labels. Since our target labels are already categorical, on the other hand, they don't need to be discretized.

Detailed implementation is shown below.

Library for Shannon Entropy

To calculate Shannon entropy, Scipy.stat.entropy is used.

Coding

InformationGain class is written, which is to be a derived class from SensorReadings class. This InformationGain class includes the entire process of IG calculation.

This class takes two inputs:

sensorData.df
number of bins for discretizing sensor-reading values

Lets go through the following code

In [17]:

class InformationGain(SensorReadings): 
    def __init__(self, _n_bins, *args, **kwargs): 
        super().__init__(*args, **kwargs)
            
        # total number of samples (400 for the given data)
        self.n_sample = len(self.df)
        
        # create a member list containing sensor names 0 to 9 
        self.sensor_list = self.df.columns[2:]
        
        # assign the number of bins as a member variable 
        self.n_bins = _n_bins
        
        # set the range of bins for a given n_bin 
        self.bins = np.linspace(0, 1, self.n_bins)
        
        # dictionary variable to store IG scores for all sensors 
        self.IG = {}
        

    def binned_label_collector(self, sensor_name):
        # this function collects class labels for corresponding reading values in each bin
        
        self.binCollector = {'bin'+str(i):[] for i in range(self.n_bins-1)}
        
        for val, label in zip(self.df[sensor_name],self.df['class_label']):    
            for i in range(self.n_bins-1):
                if val >= self.bins[i] and val < self.bins[i+1]: 
                    self.binCollector['bin'+str(i)].append(label)    
                    
                    
    def count_labels(self,d_bin):
        # this function returns the number of each labels in a binned range
        self.count_label = {'1': 0, '-1': 0}

        for l in d_bin:
            if l==1: 
                 self.count_label['1'] +=1
            else:
                 self.count_label['-1'] +=1
    
    
    def cal_H_before(self):
        
        # get the number of samples in each class
        labelCount= list(Counter(self.df['class_label']).values())
        # this should be [200, 200]    
            
        # set the base of logarithmic function as the number of classes 
        # -1 and 1 for this task, so base=2
        self.base = len(labelCount)
        
        # calculate H_before 
        self.H_before = st.entropy(labelCount, base=self.base)
                
    
    def cal_IG(self):
        
        # obtain H_before first 
        self.cal_H_before()
        
        # obtain H_after and IG by running over all sensors 
        for sensor_name in self.sensor_list: 
            
            self.binned_label_collector(sensor_name)
        
            # For current sensor, H_after is obtained by running over all bins. 
            H_after = 0
            for key, values in self.binCollector.items():
                
                self.count_labels(values)
                
                weighted_samples = (self.count_label['1'] +
                                    self.count_label['-1'])/self.n_sample
                
                H_binned = st.entropy([self.count_label['1'], self.count_label['-1']], base=self.base)   
                
                # weighted sum of each bin's entropy to get H_after 
                H_after += H_binned*weighted_samples
                
            # store IG score for the current sensor     
            self.IG[sensor_name] = self.H_before - H_after
            
        # print IG scores for all sensors 
        for sensor_name, IG_score in self.IG.items(): 
            print(sensor_name+": ", IG_score)

In [18]:

# create instances with different n_bins, here 5 and 10 are assigned 
IGForSensors = {}
IGForSensors['bin5'] = InformationGain(_n_bins=6, filename=filename)
IGForSensors['bin10'] = InformationGain(_n_bins=11, filename=filename)

Check how it works (for `sensor 0`)¶

Lets take a closer look at how this method works by tracing one sensor's IG calculation. The following code is written just to show the procedure

In [19]:

# for sensor0 for example 

# call the instance generated with n_bins=5
IGForSensors['bin5'].binned_label_collector('sensor0')

Going through all bins, and check the collected populations of label 1 and -1, and calculate $H_{bin}$ , and $H_{after}$ is a weighted sum of $H_{bin}$ from all bins

In [20]:

data = []

total_sample = 0

H_after = 0
for key, ls in IGForSensors['bin5'].binCollector.items():
    counts = list(Counter(ls).values())
    weight = sum(counts)/IGForSensors['bin5'].n_sample
    
    # entropy on each bin 
    H_bin = st.entropy(counts, base=2)
    
    print(key, dict(Counter(ls)))
 
    total_sample += sum(counts)
    
    data.append([key, dict(Counter(ls)), H_bin, weight])
    
    H_after += H_bin*weight
    
print("")
# check the total number of samples 
if total_sample == IGForSensors['bin5'].n_sample:
    print("All samples {} are collected!".format(total_sample))

bin0 {1.0: 9, -1.0: 44}
bin1 {1.0: 17, -1.0: 69}
bin2 {1.0: 38, -1.0: 54}
bin3 {1.0: 76, -1.0: 18}
bin4 {1.0: 60, -1.0: 15}

All samples 400 are collected!

In [21]:

df_sample = pd.DataFrame(data=data, columns=['bin_index', 'label_counts', 'H_bin', 'weight'])

df_sample = df_sample.set_index('bin_index')

display(df_sample)

print("H_after = {}".format(H_after))
print("thus, IG({}) = {}".format('sensor0',1- H_after))

	label_counts	H_bin	weight
bin_index
bin0	{1.0: 9, -1.0: 44}	0.657273	0.1325
bin1	{1.0: 17, -1.0: 69}	0.717252	0.2150
bin2	{1.0: 38, -1.0: 54}	0.978071	0.2300
bin3	{1.0: 76, -1.0: 18}	0.704577	0.2350
bin4	{1.0: 60, -1.0: 15}	0.721928	0.1875

H_after = 0.7671913210009906
thus, IG(sensor0) = 0.23280867899900937

IG score for sensor0 is obtained.

IG scores for all sensors¶

Now we can get IG scores for all sensors simply by calling cal_IG() method.

the results are plotted after sorted in descending order.

In [22]:

IGForSensors['bin5'].cal_IG()

# store the results to rank after sorting scores in descending order  
rank['InformationGain_bin5'] = sortScores(IGForSensors['bin5'].IG)

# plot the results
plotFeatureScores('InformationGain_bin5', 'IG score')

sensor0:  0.23280867899900937
sensor1:  0.2698171579584686
sensor2:  0.3241527933534172
sensor3:  0.149810629422372
sensor4:  0.2998660346863
sensor5:  0.14143738505578096
sensor6:  0.7101095005699908
sensor7:  0.055102413923765026
sensor8:  0.38730781000504244
sensor9:  0.10794277483157944

The results show that IG scores are greatest for sensor 6 and 8, while sensor 9 and 7 have the smaller scores. We can cross-check this result against individual sensor reading plots. Indeed sensor 6 and 8 's readings are reasonably well separated in label 1 and -1. This result shows the information gain can be a good metric for ranking sensors.

For 10 bins

We can rank sensors with different n_bins, and check how the results change

In [23]:

IGForSensors['bin10'].cal_IG()

# store the results to rank after sorting scores in descending order  
rank['InformationGain_bin10'] = sortScores(IGForSensors['bin10'].IG)

# plot the results
plotFeatureScores('InformationGain_bin10', 'IG score')

sensor0:  0.29337656712255455
sensor1:  0.3942302164526811
sensor2:  0.379610106830096
sensor3:  0.1821675723350179
sensor4:  0.37082221149085526
sensor5:  0.21009798987609918
sensor6:  0.8330243789701728
sensor7:  0.07956458908518493
sensor8:  0.49326446490948794
sensor9:  0.12487441896598928

Alternative Methods¶

Next lets try to rank sensors by other methods. Here we will consider other methods that dedicated libraries are available so that the results can be compared with my IG ranking results.

Decision Tree¶

First, decision tree algorithm is considered, and easily get the ranking result.

In [24]:

from sklearn.tree import DecisionTreeRegressor

forest = DecisionTreeRegressor(random_state=0)
forest.fit(X, y)

importances = forest.feature_importances_

temp = {f:s for f, s in zip(feature_names, forest.feature_importances_)}    

# store the result in rank 
rank['DecisionTree'] = sortScores(temp) 
    
for sensor, val in temp.items():
    print(sensor,  round(val,4))
    
plotFeatureScores('DecisionTree', 'importance score')

sensor0 0.0563
sensor1 0.0097
sensor2 0.0
sensor3 0.0
sensor4 0.0031
sensor5 0.0
sensor6 0.3432
sensor7 0.0
sensor8 0.5877
sensor9 0.0

Random Forest¶

Second method is Random Forest.

In [25]:

from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor()
RFR.fit(X, y)

temp = {f:s for f, s in zip(feature_names, RFR.feature_importances_)}    
rank['RandomForestRegressor'] = sortScores(temp) 

for sensor, val in temp.items():
    print(sensor,  round(val,4))

plotFeatureScores('RandomForestRegressor', 'importance score')

sensor0 0.0354
sensor1 0.017
sensor2 0.0221
sensor3 0.0043
sensor4 0.0629
sensor5 0.0041
sensor6 0.2811
sensor7 0.0059
sensor8 0.5625
sensor9 0.0047

Collect all results¶

Now we can collect and compare four sensor-ranking results.

All ranking results are stored in a dataframe df_rank.

In [26]:

res = []
for method in rank.keys():
    sensor_number = ['sensor '+s[-1] for s in rank[method].keys()]
    res.append(sensor_number)

In [27]:

df_rank = pd.DataFrame(data=np.array(res).T, columns=rank.keys())
display(df_rank)

	InformationGain_bin5	InformationGain_bin10	DecisionTree	RandomForestRegressor
0	sensor 6	sensor 6	sensor 8	sensor 8
1	sensor 8	sensor 8	sensor 6	sensor 6
2	sensor 2	sensor 1	sensor 0	sensor 4
3	sensor 4	sensor 2	sensor 1	sensor 0
4	sensor 1	sensor 4	sensor 4	sensor 2
5	sensor 0	sensor 0	sensor 2	sensor 1
6	sensor 3	sensor 5	sensor 3	sensor 7
7	sensor 5	sensor 3	sensor 5	sensor 9
8	sensor 9	sensor 9	sensor 7	sensor 3
9	sensor 7	sensor 7	sensor 9	sensor 5

Export result to csv file

In [28]:

df_rank.to_csv('sensor_rank_result.csv', index=True)

728x90

저작자표시 (새창열림)

'Programming > Machine Learning' 카테고리의 다른 글

Relative Standard Deviation(RSD) 란? (ft. 간단한 Python 예제) (0)	2022.05.09
1D Convolutional Neural Network 이해하기 (CNN in numpy & keras) (0)	2021.08.27
Seaborn boxplot으로 five-number summary 이해하기 (0)	2021.05.03
Information Gain (간단한 예제 & 파이썬 코드) (3)	2020.12.12
Tensorflow: regression 기본 예제 (연료 효율성 예측) (0)	2020.11.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Voyager

Feature Importance with Information Gain

Table of Contents

Description of Problem¶

Load data¶

Visualizing sensor signals¶

Split dataset into features and target¶

Clustering algorithm¶

Principal Component Analysis¶

k-means clustering¶

Visualizing clustering results¶

Ranking sensors¶

Helper functions¶

Information Gain¶

Ranking Sensors Problem¶

Check how it works (for `sensor 0`)¶

IG scores for all sensors¶

Alternative Methods¶

Decision Tree¶

Random Forest¶

Collect all results¶

'Programming > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Feature Importance with Information Gain

Table of Contents

Description of Problem¶

Load data¶

Visualizing sensor signals¶

Split dataset into features and target¶

Clustering algorithm¶

Principal Component Analysis¶

k-means clustering¶

Visualizing clustering results¶

Ranking sensors¶

Helper functions¶

Information Gain¶

Ranking Sensors Problem¶

Check how it works (for sensor 0)¶

IG scores for all sensors¶

Alternative Methods¶

Decision Tree¶

Random Forest¶

Collect all results¶

'Programming > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Check how it works (for `sensor 0`)¶