Historia pewnego procesu rekrutacyjnego

rekrutacja w data science, praca w data science, proces rekrutacujny, hr, praca, data science

Podczas procesu rekrutacyjnego na stanowisko związane z analizą danych dosyć powszechną praktyką jest wymaganie od kandydata wykonania przykładowej analizy na zbiorze dostarczonym przez przyszłego pracodawcę. Ten wpis przedstawia tego typu sytuację, której sam doświadczyłem kilka lat temu.

Otrzymuję od Was dużo pytań związanych z procesem rekrutacyjnym i poszukiwaniem pracy w obszarze analizy danych. Do tej pory nie poruszałem na blogu tego tematu. Po jednym z maili przypomniałem sobie, że w archiwum komputera mam zapisane pliki (notebooki, dokumenty) z procesów rekrutacyjnych, w których brałem udział.

Wspomniane pliki mi się już raczej nie przydadzą, a dla kogoś z Was mogą okazać się przydatne, dlatego chciałbym się nimi podzielić (uwzględnię jedynie dane ze źródeł otwartych i pominę wszelkie dane poufne).

Historia stojąca za przedstawianym procesem rekrutacyjnym

Skąd firma dowiedziała się o mnie?

Otrzymałem telefon od rekrutera. Znajoma za moją zgodą przesłała mój numer telefonu do firmy, która na polecenie mojego niedoszłego pracodawcy prowadziła rekrutacje.

Jak nazywały się: stanowisko, firma i jaka to była branża?

Niestety nie pamiętam, jaka była dokładna nazwa stanowiska, na które aplikowałem. Głównym zadaniem osoby na tym stanowisku było modelowanie ryzyka. Nazwy firmy nie podaję z oczywistych względów. Była to branża ubezpieczeniowa i jedna z większych międzynarodowych korporacji mających swą siedzibę w Warszawie.

Zadania do wykonania

Na ostatnim etapie, do którego doszedłem, zostałem poproszony o wykonanie dowolnej analizy na zbiorze „Wine quuality”. Wynikiem analizy miały być dwa, napisane w języku angielskim dokumenty:

  1. Notatnik Jupyter – prezentuję go w punkcie „Example approach to risk assessment problem”.
  2. Streszczenie zarządcze w formacie pdf – link znajduje się tutaj.

Czytając pierwszy z dokumentów, który przygotowałem być może zadasz pytanie: „Mateusz, czemu zdecydowałeś się na drzewa decyzyjne?”. Po części jest to wyjaśnione w samej analizie. Dodatkowymi powodami były:

  1. Drzewa możliwość zaprezentowania swojego warsztatu w zakresie optymalizacji parametrów.
  2. Podobnie jak regresja drzewa są interpretowalne.
  3. Regresja z dodatkami (kategoryzacja, WoE i zaawansowana selekcja zmiennych) dla tego problemu wypadała za dobrze. O ile dobrze pamiętam, to uzyskiwałem pełną separację obserwacji w zbiorze.
Jak zakończył się proces rekrutacyjny?

Po dwóch lub trzech etapach, które przeszedłem, niespodziewanie otrzymałem informację, że firma zmieniła swoje oczekiwania i jednak poszukuje osoby na stanowisko Big Data Engineer. Byłem zaskoczony. Proces trwał około miesiąca, poświęciłem na niego wiele godzin, a nie otrzymałem nawet informacji zwrotnej dotyczącej tego, jak w nim wypadłem. Jaka była tego przyczyna? Czy zawiniła firma ubezpieczeniowa, czy też pośrednik prowadzący proces? Nie wiem. Pokazuje to jednak realia części procesów rekrutacyjnych.

Lekcje płynące z tego procesu były dla mnie następujące:

  1. Już na starcie procesu warto umawiać się na konkretne warunki współpracy na etapie rekrutacji. Bez tego w sytuacji, gdy nie otrzymasz stanowiska, na które aplikujesz, nie wiesz jakie błędy popełniasz. Nie korygujesz ich, więc wolniej się rozwijasz. Nie wiesz, gdzie znajduje się luka, którą należy wypełnić, by w końcu otrzymać wymarzone stanowisko.
  2. Nie warto brać udziału w procesach prowadzonych przez firmy pośredniczące. Jest to moja prywatna, subiektywna ocena wynikająca z doświadczeń. Powodów ku temu jest kilka, jednak najważniejsze z nich, to:
    1. Zakładam, że w znacznej większości przypadków negocjując bezpośrednio z pracodawcą, jesteś w stanie uzyskać lepsze warunki (lepsze tzn. dopasowane do Twoich oczekiwań). Nie mam ku temu żadnych danych potwierdzających, jednak logika podpowiada, że pośrednik jest jeszcze jedną osobą, której trzeba zapłacić, by zatrudnić nowego pracownika.
    2. Kontakt z przyszłym przełożonym przypomina grę w głuchy telefon. Pytam o coś, rekruter nie zna odpowiedzi i wraca po kilku dniach (w optymistycznym przypadku ;-)).
    3. Rekruter może w ogóle nie znać specyfiki stanowiska, więc nie będziecie mówić tym samym językiem (tak było w dwóch procesach, w których brałem udział). Moja wiedza była weryfikowana na zasadzie quizu – znasz dobrą odpowiedź albo nie. Czarne albo białe bez odcieni szarości. Dla osoby, która nie pracuje w obszarze Data Science jest tylko jedna prawidłowa odpowiedź na pytanie – ta, którą ma przygotowaną na kartce przez zleceniodawcę. Dla przykładu otrzymałem kiedyś pytanie dotyczące znajomości systemu linux: „Jak sprawdzić poprzednie polecenie wpisywane w terminalu?”. Odpowiedziałem, że ja najczęściej używam strzałek na klawiaturze. „Błąd” – odpowiedział rekruter. „Chodzi o polecenie: history”.

Jeśli bierzesz udział w procesie rekrutacyjnym, pamiętaj, że niezależnie od tego, jak zakończy się proces, masz prawo oczekiwać informacji zwrotnej po jego zakończeniu. Poświęcasz swój prywatny czas, który możesz przecież wykorzystać w inny sposób. Niech będzie to zatem umowa typu „win-win” dla obu stron, niezależnie od końcowego rezultatu. 😉


Krótkie wyjaśnienie.

Nie odbierz proszę tego wpisu, jako próby wylania żalu na ciężką dolę pracownika IT na rynku pracy, w którym rządzą i dzielą wielkie korporacje. 🙂 Uważam, że w naszej branży jest zupełnie odwrotnie i pracownikom umysłowym jeszcze nigdy w historii nie było tak dobrze, jak teraz. Możliwości, jakie dziś mamy są ogromne, za co jestem wdzięczny Opatrzności. Oczywiście nie oznacza to, że nasza sytuacja się nie zmieni. Przyszłość może przynieść duże zmiany, ale o tym pisałem szerzej we wpisie: Jaki los czeka data scientistów?

Podsumowując: jedynym motywatorem stojącym za tym wpisem jest chęć dzielenia się wiedzą i własnymi doświadczeniami.


Example approach to risk assessment problem

Goals:

  • Analyze data set.
  • Discovering knowledge about data set (correlation and dependences between features).
  • Build build a good quality classifier to classify wines types (white/red).

Main assumptions:

  • It is classification of wines, but I will approach the problem as it would be problem of risk assessment to show the applications of data analysis and modeling methods in the financial sector.
  • As long as it is financial sector, the model has to be interpretable due to both: legal regulations and business requirements.
  • In risk assessment probability is more important that classified class by the preditive model, so the gini score will be used as the basic measure of the quality of the model.
  • In risk assessment false negatives are more expensive for the company, than false positives. That is why recall will be used as a additional measure of the quality of quality of the model.
  • In financial serices predictive model has to be stable and reliable. I will measure stability in cross validation test using coefficient of variation.

Data set:

1. Loading required libraries.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, roc_curve, recall_score
from sklearn.feature_selection import RFE
from scipy.stats import randint
from sklearn import tree
import graphviz

2. Loading data and summarizing the set.

2.1. Loading data and checking feature types.

In [43]:
white_wines = pd.read_csv('data/winequality-white.csv', sep = ';')
red_wines = pd.read_csv('data/winequality-red.csv', sep = ';')
In [44]:
white_wines.head()
Out[44]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
In [45]:
red_wines.head()
Out[45]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Checking it there any additional footers in data sets.

In [46]:
white_wines.tail()
Out[46]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
4893 6.2 0.21 0.29 1.6 0.039 24.0 92.0 0.99114 3.27 0.50 11.2 6
4894 6.6 0.32 0.36 8.0 0.047 57.0 168.0 0.99490 3.15 0.46 9.6 5
4895 6.5 0.24 0.19 1.2 0.041 30.0 111.0 0.99254 2.99 0.46 9.4 6
4896 5.5 0.29 0.30 1.1 0.022 20.0 110.0 0.98869 3.34 0.38 12.8 7
4897 6.0 0.21 0.38 0.8 0.020 22.0 98.0 0.98941 3.26 0.32 11.8 6
In [47]:
red_wines.tail()
Out[47]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

Merging both data sets into one, to analyze both and then train and evaluate predictive model.

In [48]:
white_wines['y'] = 0
red_wines['y'] = 1
In [49]:
wines = pd.concat([white_wines, red_wines], axis = 0)

2.2. Preview of full data set.

In [50]:
wines.head()
Out[50]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality y
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6 0
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6 0
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6 0
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6 0
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6 0

So there are only numeric features in data set, except target variable.

2.3. Data set size.

In [51]:
print('Data set contains {} observations, described by {} features.'.format(wines.shape[0], wines.shape[1]))
Data set contains 6497 observations, described by 13 features.

2.4. Available features.

In [12]:
print('Available features in data set: {}'.format(list(wines.columns)))
Available features in data set: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality', 'y']

2.5. Preview of feature types and missing variables.

In [13]:
summary = pd.DataFrame(wines.dtypes, columns=['Feature type'])
summary['Is any null?'] = pd.DataFrame(wines.isnull().any())
summary['Sum of nulls'] = pd.DataFrame(wines.isnull().sum())
summary['Percent of nulls'] = round((wines.apply(pd.isnull).mean()*100),2)
summary
Out[13]:
Feature type Is any null? Sum of nulls Percent of nulls
fixed acidity float64 False 0 0.0
volatile acidity float64 False 0 0.0
citric acid float64 False 0 0.0
residual sugar float64 False 0 0.0
chlorides float64 False 0 0.0
free sulfur dioxide float64 False 0 0.0
total sulfur dioxide float64 False 0 0.0
density float64 False 0 0.0
pH float64 False 0 0.0
sulphates float64 False 0 0.0
alcohol float64 False 0 0.0
quality int64 False 0 0.0
y int64 False 0 0.0

3. Descriptive statistics.

3.1. Numeric features.

In [40]:
wines.describe()
In [14]:
wines.agg(['mean', 'median'])
Out[14]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality y
mean 7.215307 0.339666 0.318633 5.443235 0.056034 30.525319 115.744574 0.994697 3.218501 0.531268 10.491801 5.818378 0.246114
median 7.000000 0.290000 0.310000 3.000000 0.047000 29.000000 118.000000 0.994890 3.210000 0.510000 10.300000 6.000000 0.000000

There is one feature, where mean is slightly higher than median ('residual sugar'). It may be indicate of skewness of the distribution. Based on those numbers I would say, that distribution looks quite fine, but I will make sure by checking it using plots.

3.2. Categorical features.

In [15]:
wines.y.value_counts(normalize = True)
Out[15]:
0    0.753886
1    0.246114
Name: y, dtype: float64

Not very ballanced data set. Contains around 75 percent of white wines, and "only" around 25 percent.

4. Analysis of individual features.

4.1. Target variable.

In [16]:
y_dist = wines.y.value_counts(normalize = True)
In [17]:
y_dist.index = ['white wines', 'red wines']
In [18]:
print('Distribution of target variable:')
y_dist
Distribution of target variable:
Out[18]:
white wines    0.753886
red wines      0.246114
Name: y, dtype: float64

4.2. Numeric features.

Test for normality of distribution.

Base on the function description: "This function tests the null hypothesis that a sample comes from a normal distribution.". So:

  • H0 = sample comes from a normal distribution,
  • H1 = sample not come from a normal distribution.
In [21]:
for feature in wines.columns.drop('y'):
    alpha = 0.05
    p_value = scipy.stats.normaltest(wines[feature])[1]
    if(p_value < alpha):
        print('For feature \'' + feature +'\' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.')
    else:
        print('For feature \'' + feature +'\' null hypothesis can not be rejected. Sample COMES from normal distribution.')
For feature 'fixed acidity' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'volatile acidity' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'citric acid' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'residual sugar' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'chlorides' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'free sulfur dioxide' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'total sulfur dioxide' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'density' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'pH' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'sulphates' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'alcohol' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.
For feature 'quality' null hypothesis can be rejected. Sample DOES NOT come from normal distribution.

We rejected hypothesis about distribution normality for every feature. In next steps I can not use parametric statistics (most methods), so I will choose only nonparametric statistics methods.

Check distribution of numeric features.

The scale of features is a little bit different (e.g. mean of 'volatile acidity' is about 0.33, and mean of 'total sulfur dioxide' is about 116, so it would not be easy to show distribution of all features in one plot. I decided to divide features into 3 groups, and then visualize the data.

In [53]:
melted_wines_df_1 = pd.melt(wines, value_vars=wines.drop(['y', 'total sulfur dioxide', 'free sulfur dioxide', 'residual sugar', 'fixed acidity', 'alcohol', 'quality'], axis = 1).columns, var_name=['feature_name'], value_name = 'value')
melted_wines_df_2 = pd.melt(wines, value_vars=wines[['fixed acidity', 'alcohol', 'quality']].columns, var_name=['feature_name'], value_name = 'value')
melted_wines_df_3 = pd.melt(wines, value_vars=wines[['total sulfur dioxide', 'free sulfur dioxide', 'residual sugar', 'fixed acidity', 'alcohol', 'quality']].columns, var_name=['feature_name'], value_name = 'value')
In [54]:
plt.figure(figsize=(8,5))
sns.set(font_scale=1.4)
sns.boxplot(data = melted_wines_df_1, y = 'feature_name', x = 'value', palette = 'Blues_d').set(title = 'Distribution of numeric features - group 1', ylabel = 'feature name')
plt.show()
<matplotlib.figure.Figure at 0x17ec5010f60>
<matplotlib.figure.Figure at 0x17ec715b2b0>
In [24]:
plt.figure(figsize=(13,5))
sns.set(font_scale=1.4)
sns.boxplot(data = melted_wines_df_2, y = 'feature_name', x = 'value', palette = 'Blues_d').set(title = 'Distribution of numeric features - group 2', ylabel = 'feature name')
plt.show()
In [25]:
plt.figure(figsize=(13,5))
sns.set(font_scale=1.4)
sns.boxplot(data = melted_wines_df_3, y = 'feature_name', x = 'value', palette = 'Blues_d').set(title = 'Distribution of numeric features - group 3', ylabel = 'feature name')
plt.show()

On the plots small, black dots are outliers. If I would use linear algorithms it would be great to detect and remove outliers.

In addition to the lack of normality of distribution, such a big number of outliers is another argument for using nonparametric statistics and algorithm that is not sensitive for outlieres (e.g. decission tree).

Outliers detection.

There are two methods of detecting outliers:

First of them has strong assumption: normallly distributed features. I can not use it, so I decide to use second.

In [26]:
q1 = wines.drop(['y', 'quality'], axis = 1).quantile(0.25)
q3 = wines.drop(['y', 'quality'], axis = 1).quantile(0.75)
iqr = q3 - q1
In [27]:
low_boundary = (q1 - 1.5 * iqr)
upp_boundary = (q3 + 1.5 * iqr)
num_of_outliers_L = (wines.drop(['y', 'quality'], axis = 1)[iqr.index] < low_boundary).sum()
num_of_outliers_U = (wines.drop(['y', 'quality'], axis = 1)[iqr.index] > upp_boundary).sum()
outliers = pd.DataFrame({'lower_boundary':low_boundary, 'upper_boundary':upp_boundary,'num_of_outliers_L':num_of_outliers_L, 'num_of_outliers_U':num_of_outliers_U})
In [28]:
outliers
Out[28]:
lower_boundary num_of_outliers_L num_of_outliers_U upper_boundary
fixed acidity 4.450000 7 350 9.650000
volatile acidity -0.025000 0 377 0.655000
citric acid 0.040000 279 230 0.600000
residual sugar -7.650000 0 118 17.550000
chlorides -0.002500 0 286 0.105500
free sulfur dioxide -19.000000 0 62 77.000000
total sulfur dioxide -41.500000 0 10 274.500000
density 0.985365 0 3 1.003965
pH 2.795000 7 66 3.635000
sulphates 0.175000 0 191 0.855000
alcohol 6.800000 0 3 14.000000

I will try to remove outliers and then check the size of data set.

In [29]:
w = wines.copy()
In [30]:
for row in outliers.iterrows():
    w = w[(w[row[0]] >= row[1]['lower_boundary']) & (w[row[0]] <= row[1]['upper_boundary'])]
In [31]:
w.shape
Out[31]:
(5024, 13)
In [32]:
wines.shape
Out[32]:
(6497, 13)

If I would use this method, I would have to remove about 1500 observations. I do not want to do it, so I decide to use decission tree to build predictive model.

4.3. Categorical features.

-

5. Analysis of dependencies and correlation between variables.

5.1. Analysis of correlation between target variable and other features.

  1. I can not use Pearson's correlation, because:
    • it has assumption about normal distribution of variables,
    • target variable is of a categorical type,
    • there are outliers in the data set that can disturb the analysis.
  2. I will use Spearman's rank correlation, because:
    • it has no assumption abour distribution of variables,
    • target variable is categorical variable, but also dichotomous, so in ranking it does not really matter if it is categorical (as long as I want to detect correlation and dependency level, not direction of correlation),
    • it is ranking based on values, so this method is resistant to outliers.
  3. I will use only absolute values to detect strength of relations.
In [33]:
correlation_matrix = pd.DataFrame(np.abs(scipy.stats.spearmanr(wines)[0]), columns = wines.columns, index = wines.columns)
correlation_matrix.drop('y', axis = 0, inplace = True)
correlation_matrix.reset_index(inplace=True)
In [34]:
plt.figure(figsize=(10,7))
sns.set(font_scale=1.4)
sns.barplot(data = correlation_matrix.sort_values('y', ascending=False), x = 'y', y = 'index', palette = 'Blues_d').set(title = 'Coefficient of correlation between target variable and other features', xlabel = 'coefficient of correlation', ylabel = 'feature name')
plt.show()

Base on the plot above I suspect that some variables ('total sulfur dioxide', 'chlorides', 'volatile acidity', 'free sulfur dioxide') have high predictive power as their coefficient of correlation with target variable is above 0.5.

5.2. Analysis of correlation between features.

In [35]:
correlation_matrix = pd.DataFrame(np.abs(scipy.stats.spearmanr(wines.drop('y', axis = 1))[0]), columns = wines.drop('y', axis = 1).columns, index = wines.drop('y', axis = 1).columns)
In [36]:
correlation_matrix
Out[36]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
fixed acidity 1.000000 0.200272 0.270568 0.032254 0.355964 0.259914 0.233259 0.434056 0.250044 0.220145 0.110650 0.098154
volatile acidity 0.200272 1.000000 0.295129 0.064384 0.415896 0.365673 0.343534 0.261437 0.194876 0.255042 0.023924 0.257806
citric acid 0.270568 0.295129 1.000000 0.074920 0.074084 0.122058 0.159355 0.065690 0.285905 0.036898 0.019653 0.105711
residual sugar 0.032254 0.064384 0.074920 1.000000 0.035800 0.387750 0.454886 0.526664 0.229344 0.138157 0.329218 0.016891
chlorides 0.355964 0.415896 0.074084 0.035800 1.000000 0.260421 0.268434 0.590729 0.163528 0.370450 0.401270 0.295054
free sulfur dioxide 0.259914 0.365673 0.122058 0.387750 0.260421 1.000000 0.741438 0.005841 0.164699 0.221062 0.186046 0.086865
total sulfur dioxide 0.233259 0.343534 0.159355 0.454886 0.268434 0.741438 1.000000 0.061540 0.242719 0.256745 0.308982 0.054777
density 0.434056 0.261437 0.065690 0.526664 0.590729 0.005841 0.061540 1.000000 0.011777 0.274792 0.699442 0.322806
pH 0.250044 0.194876 0.285905 0.229344 0.163528 0.164699 0.242719 0.011777 1.000000 0.254263 0.140225 0.032538
sulphates 0.220145 0.255042 0.036898 0.138157 0.370450 0.221062 0.256745 0.274792 0.254263 1.000000 0.004583 0.029831
alcohol 0.110650 0.023924 0.019653 0.329218 0.401270 0.186046 0.308982 0.699442 0.140225 0.004583 1.000000 0.446925
quality 0.098154 0.257806 0.105711 0.016891 0.295054 0.086865 0.054777 0.322806 0.032538 0.029831 0.446925 1.000000
In [37]:
plt.figure(figsize=(15,6))
sns.set(font_scale=1)
sns.heatmap(correlation_matrix.abs(), cmap="Blues", linewidths=.5).set(title='Heatmap of coefficients of correlation between features')
plt.show()

Correlation between numeric features is quite strong in some cases:

  • 'free sulfur dioxide' and 'total sulfur dioxide' - based on their names that is what I expected :),
  • 'density' and 'residual sugar',
  • 'density' and 'chlorides',
  • 'density' and 'alcohol'.

I will look a little big closer into those correlations using contingency tables.

In [38]:
# Pair number 1.
ct_1 = pd.crosstab(wines['free sulfur dioxide'] > wines['free sulfur dioxide'].mean(), wines['total sulfur dioxide'] > wines['total sulfur dioxide'].mean())
ct_1.index = ["'free sulfur dioxide' below mean", "free sulfur dioxide' above mean"]
ct_1.columns = ["'total sulfur dioxide' below mean", "'total sulfur dioxide' above mean"]

# Pair number 2.
ct_2 = pd.crosstab(wines['density'] > wines['density'].mean(), wines['residual sugar'] > wines['residual sugar'].mean())
ct_2.index = ["'density' below mean", "density' above mean"]
ct_2.columns = ["'residual sugar' below mean", "'residual sugar' above mean"]

# Pair number 3.
ct_3 = pd.crosstab(wines['density'] > wines['density'].mean(), wines['chlorides'] > wines['chlorides'].mean())
ct_3.index = ["'density' below mean", "density' above mean"]
ct_3.columns = ["'chlorides' below mean", "'chlorides' above mean"]

# Pair number 4. 
ct_4 = pd.crosstab(wines['density'] > wines['density'].mean(), wines['alcohol'] > wines['alcohol'].mean())
ct_4.index = ["'density' below mean", "density' above mean"]
ct_4.columns = ["'alcohol' below mean", "'alcohol' above mean"]
In [39]:
ct_1
Out[39]:
'total sulfur dioxide' below mean 'total sulfur dioxide' above mean
'free sulfur dioxide' below mean 2540 968
free sulfur dioxide' above mean 567 2422
In [40]:
ct_2
Out[40]:
'residual sugar' below mean 'residual sugar' above mean
'density' below mean 2518 593
density' above mean 1522 1864
In [41]:
ct_3
Out[41]:
'chlorides' below mean 'chlorides' above mean
'density' below mean 2768 343
density' above mean 1672 1714
In [42]:
ct_4
Out[42]:
'alcohol' below mean 'alcohol' above mean
'density' below mean 841 2270
density' above mean 2650 736

In all cases a large disparity is visible. It there was small correlation then numbers in all four quadrants would be almost equal.

5.3. Visualisation of the data set on two-dimensional space.

In [43]:
pca_model_2d = PCA(n_components = 2)
pca_dataset_2d = pd.DataFrame(pca_model_2d.fit_transform(wines.drop('y', axis = 1)), columns = ['component_1', 'component_2'])
pca_dataset_2d['y'] = wines.y.replace([0, 1], ['white', 'red']).values
In [44]:
sns.set(font_scale=1.3)
sns.lmplot(data = pca_dataset_2d, x = 'component_1', y = 'component_2', hue = 'y', fit_reg=False, size=6, aspect=1.2).set(title = 'Visualisation of the data set on 2D space')
plt.show()

We can see that both subsets are pretty close. Red wines, from this perspective seems to be "subset" of white wines. Also, there are few outliers visibled.

Both subsets overlaps, so I will plot it again with a little bit of transparency.

In [45]:
sns.set(font_scale=1.3)
sns.lmplot(data = pca_dataset_2d, x = 'component_1', y = 'component_2', hue = 'y', fit_reg=False, scatter_kws={'alpha':0.1}, size=6, aspect=1.2).set(title = 'Visualisation of the data set on 2D space')
plt.show()

Now we can see, there is a little bit of difference between both.

6. Dataset preparation.

There are no categorical variables (except target, but it is a binary type variable) so there is no need to decode the data.

Spliting data set.

The data will be splited into two sets training and test, using stratification - both sets has to be the same distribution of target variable.

In [14]:
print("Distribution of target variable in 'wines' data set:")
wines.y.value_counts(normalize=True)
Distribution of target variable in 'wines' data set:
Out[14]:
0    0.753886
1    0.246114
Name: y, dtype: float64

I am copying target varieble to 'y' series and then dropping it from 'wines' dataset. As a random state I use today's date (5th of June, 2018).

In [15]:
y = wines.y
wines.drop('y', axis = 1, inplace = True)
In [16]:
x_tr, x_te, y_tr, y_te = train_test_split(wines, y, test_size = 0.2, random_state = 5062018, stratify = y)
In [17]:
print("Distribution of datasets in both data set after splitting. \n \n")
print('Training data set:')
print(y_tr.value_counts(normalize=True))
print('')
print('Test data set:')
print(y_te.value_counts(normalize=True))
Distribution of datasets in both data set after splitting. 
 

Training data set:
0    0.753896
1    0.246104
Name: y, dtype: float64

Test data set:
0    0.753846
1    0.246154
Name: y, dtype: float64

Distributions are quite similar, so now I can start building predictive models.

Approach to building and evaluating models.

I will not use test dataset (x_te, y_te) to feature selection or parameters selection. I am going to use it only two times:

  1. At the beginnign - to check quality of my 'benchmark' model. I will compare this result with result achieved by my final model.
  2. At the very end - to simulate real case, and check my final model on data set he has never seen before.

To select variable and tune model hyperparameters I will use cross validation.

As I mentioned, I will use decission tree classifier for three main reasons:

  • it has no assumptions about distribution of the data,
  • it can be easily interpretable,
  • it is not sensitive to outliers.

6.1. Model #1.

Description:

  • Method of feature selection: none.
  • Method of algorithm parameters selection: none.

This model is going to be my benchmark. By building and scoring it I will be able to compare my next models.

To automate my work just a little bit I will use to functions that has been written earlier.
In [18]:
def cv_and_score_model(model, x, y, prints):
    """Model scorer. Calculates: average gini score, average recall and stability.

    Parameters:
    -----------
    model: sklearn predictive model, model that will be scored
    
    x : pandas DataFrame, set of x-features

    y : pandas Series, target feature
    """
    
    cv_auc = cross_val_score(model, x, y, cv = 10, scoring = 'roc_auc')
    cv_recall = cross_val_score(model, x, y, cv = 10, scoring = 'recall')
    
    # Calculate Gini score based on AUC.
    cv_gini = (cv_auc * 2) - 1 
    
    # Printing results.
    if prints:
        print('Average Gini: {}.'.format(np.mean(cv_gini).round(3)))
        print('Average recall: {}.'.format(np.mean(cv_recall).round(3)))
        print('Stability: {}%'.format((100 - np.std(cv_gini)*100/cv_gini.mean()).round(3)))

    return cv_gini
In [19]:
def test_model(model, features, plots):
    """Model scorer. Calculates: average gini score, average recall and stability.

    Parameters:
    -----------
    model: sklearn predictive model, model that will be tested
    
    plots : bool, decission whether to print plots
    """
    
    model.fit(x_tr[features], y_tr)
    pred_prob = model.predict_proba(x_te[features])
    y_pred = model.predict(x_te[features])
    
    # convert 2D list to list of probabilities
    prob = []
    for n in pred_prob:
        prob.append(n[1])
    gini_score = (2* roc_auc_score(y_te, prob))-1
    
    # calculate FPR and TPR
    fpr, tpr, thresholds = roc_curve(y_te, prob)
    
    print('Gini score: {}.'.format(gini_score.round(3)))
    print('Recall score: {}.'.format(recall_score(y_te, y_pred).round(3)))
    
    if plots == True:
        # ROC CURVE
        plt.figure()
        lw = 2
        plt.plot(fpr, tpr, color='blue',
                 lw=lw, label='ROC curve (area under curve = %0.3f)' % roc_auc_score(y_te, prob))
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('ROC curve')
        plt.legend(loc="lower right")
        plt.show()

        # DENSITY PLOT
        pre = pd.DataFrame(prob, columns = ['prob'])
        pre['pred_class'] = 1
        pre.loc[pre.prob < 0.5,'pred_class'] = 0
        pre['real_class'] = list(y_te)

        sns.distplot(pre[pre.real_class == 0]['prob'], label = 'negatives').set(title = 'Density plot', xlabel = 'probability')
        sns.distplot(pre[pre.real_class == 1]['prob'], label = 'positives').set(title = 'Density plot', xlabel = 'probability')
        plt.legend()
        plt.show()
In [52]:
model_1 = DecisionTreeClassifier()

Model 1 - results.

In [55]:
gini = cv_and_score_model(model_1, x_tr, y_tr, True)
Average Gini: 0.96.
Average recall: 0.968.
Stability: 98.172%
In [56]:
test_model(model_1, x_tr.columns, False)
Gini score: 0.972.
Recall score: 0.981.

8.2. Model 2.

Description:

  • Methods of feature selection:
    • Feature importances (using decission tree model),
    • Forward selection,
    • Backward selection,
    • RFE (recursive feature elimination).
  • Method of algorithm parameters selection: none.

Again, there are no functions available in Python like: backward selection and forward selection, so I will use my own functions 🙂

In [20]:
def forward_selection(model, x, y):
    """Forward selection method.

    Parameters:
    -----------
    model : sklearn model, that will be used in features selection 
    
    x : pandas DataFrame with all possible predictors
    
    y : pandas Series with target variable

    Returns:
    --------
    best_features: "optimal" set of features selected by forward selection evaluated by Gini score
    """
    
    
    remaining = set(x.columns)
    best_features = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            gini_score = cv_and_score_model(model, x[best_features+[candidate]], y_tr, False) 
            scores_with_candidates.append((gini_score.mean(), candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            best_features.append(best_candidate)
            current_score = best_new_score
    return best_features
In [21]:
def backward_selection(model, x, y):
    """Backward selection method.

    Parameters:
    -----------
    model : sklearn model, that will be used in features selection 
    
    x : pandas DataFrame with all possible predictors
    
    y : pandas Series with target variable

    Returns:
    --------
    best_features: "optimal" set of features selected by backward selection evaluated by Gini score
    """
    
    
    remaining = set(x.columns)
    features_to_remove = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            gini_score = cv_and_score_model(model, x[list(remaining)].drop([candidate], axis = 1), y_tr, False) 
            scores_with_candidates.append((gini_score.mean(), candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            features_to_remove.append(best_candidate)
            current_score = best_new_score
    best_features = list(x.drop(features_to_remove, axis = 1).columns)
    return best_features

New model.

In [102]:
model_2 = DecisionTreeClassifier()
model_2.fit(x_tr, y_tr)
Out[102]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Feature importances - using decission tree model.

In [103]:
pd.Series(model_2.feature_importances_, index = x_tr.columns).sort_values(ascending = False)
Out[103]:
total sulfur dioxide    0.666565
chlorides               0.207824
volatile acidity        0.048527
sulphates               0.021964
density                 0.019074
pH                      0.012296
alcohol                 0.008947
residual sugar          0.007544
free sulfur dioxide     0.003401
fixed acidity           0.002600
quality                 0.000778
citric acid             0.000480
dtype: float64

FI shows us that I should keep all of features. Let's check this theory using different methods.

Forward selection.

In [104]:
selected_features_fs = forward_selection(model_2, x_tr, y_tr)
In [105]:
print('Number of selected features: {}.'.format(len(selected_features_fs)))
Number of selected features: 5.
In [106]:
print('Selected features: {}.'.format(list(selected_features_fs)))
Selected features: ['total sulfur dioxide', 'chlorides', 'volatile acidity', 'density', 'residual sugar'].

Cross validation of a model:

In [107]:
gini = cv_and_score_model(model_2, x_tr[selected_features_fs], y_tr, True)
Average Gini: 0.963.
Average recall: 0.969.
Stability: 98.347%

Backward selection.

In [108]:
selected_features_bs = backward_selection(model_2, x_tr, y_tr)
In [109]:
print('Number of selected features: {}.'.format(len(selected_features_bs)))
Number of selected features: 10.
In [110]:
print('Selected features: {}.'.format(list(selected_features_bs)))
Selected features: ['fixed acidity', 'volatile acidity', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'alcohol', 'quality'].

Cross validation of a model:

In [111]:
gini = cv_and_score_model(model_2, x_tr[selected_features_bs], y_tr, True)
Average Gini: 0.963.
Average recall: 0.971.
Stability: 98.858%

Recursive feature elimination.

In [112]:
for n in np.arange(1, 13, 1):
    selector = RFE(model_2, n, 1)
    print('=================================')
    print('Model with {} features.'.format(n))
    print('=================================')
    cv_and_score_model(model_2, x_tr.iloc[:,selector.fit(x_tr, y_tr).support_], y_tr, True)
    print('')
=================================
Model with 1 features.
=================================
Average Gini: 0.897.
Average recall: 0.777.
Stability: 97.705%

=================================
Model with 2 features.
=================================
Average Gini: 0.926.
Average recall: 0.94.
Stability: 97.714%

=================================
Model with 3 features.
=================================
Average Gini: 0.948.
Average recall: 0.961.
Stability: 98.215%

=================================
Model with 4 features.
=================================
Average Gini: 0.958.
Average recall: 0.965.
Stability: 98.299%

=================================
Model with 5 features.
=================================
Average Gini: 0.959.
Average recall: 0.97.
Stability: 98.246%

=================================
Model with 6 features.
=================================
Average Gini: 0.96.
Average recall: 0.968.
Stability: 98.081%

=================================
Model with 7 features.
=================================
Average Gini: 0.959.
Average recall: 0.967.
Stability: 97.862%

=================================
Model with 8 features.
=================================
Average Gini: 0.962.
Average recall: 0.968.
Stability: 98.125%

=================================
Model with 9 features.
=================================
Average Gini: 0.96.
Average recall: 0.971.
Stability: 98.042%

=================================
Model with 10 features.
=================================
Average Gini: 0.956.
Average recall: 0.968.
Stability: 97.97%

=================================
Model with 11 features.
=================================
Average Gini: 0.959.
Average recall: 0.967.
Stability: 97.909%

=================================
Model with 12 features.
=================================
Average Gini: 0.957.
Average recall: 0.97.
Stability: 98.185%

In [113]:
selector = RFE(model_2, 8, 1)
selected_features_rfe = x_tr.iloc[:,selector.fit(x_tr, y_tr).support_].columns
In [114]:
print('Number of selected features: {}.'.format(len(selected_features_rfe)))
Number of selected features: 8.
In [115]:
print('Selected features: {}.'.format(list(selected_features_rfe)))
Selected features: ['volatile acidity', 'residual sugar', 'chlorides', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'].
In [116]:
gini = cv_and_score_model(model_2, x_tr[selected_features_rfe], y_tr, True)
Average Gini: 0.958.
Average recall: 0.973.
Stability: 98.069%

From the methods described above, backward_selectione achieved the best result, so I am going to use it as my main feature selection method, and build my second model using features chosen by this method.

8.3. Model 3

Description:

  • Method of feature selection: Backward selection.
  • Method of algorithm parameters selection:
    • RandomizedSearchCV,
    • GridSearchCV.

GridSearchCV.

In [117]:
parameters = {'criterion':('entropy', 'gini'), 'splitter':('best','random'), 'max_depth':np.arange(1, 6), 'min_samples_split':np.arange(2,10), 'min_samples_leaf':np.arange(1,5)}
In [118]:
grid_search = GridSearchCV(DecisionTreeClassifier(), parameters, cv=10)
In [119]:
grid_search.fit(x_tr[selected_features_bs], y_tr)
Out[119]:
GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'criterion': ('entropy', 'gini'), 'splitter': ('best', 'random'), 'max_depth': array([1, 2, 3, 4, 5]), 'min_samples_split': array([2, 3, 4, 5, 6, 7, 8, 9]), 'min_samples_leaf': array([1, 2, 3, 4])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [120]:
best_parameters_gs = grid_search.best_params_
In [121]:
print('Best parameters choosed by GridSearchCV: {}'.format(best_parameters_gs))
Best parameters choosed by GridSearchCV: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 4, 'splitter': 'best'}

Cross validation of a model:

In [122]:
model_3 = DecisionTreeClassifier(**best_parameters_gs)
In [124]:
gini = cv_and_score_model(model_3, x_tr[selected_features_bs], y_tr, True)
Average Gini: 0.974.
Average recall: 0.962.
Stability: 98.727%

RandomizedSearchCV.

In [127]:
params_rs = {'criterion':('entropy', 'gini'),
'splitter':('best','random'),
'max_depth':randint(1,6),
'min_samples_split':randint(3,8),
'min_samples_leaf':randint(1,5)}
In [128]:
rs = RandomizedSearchCV(model_3, cv = 10, n_iter = 20, param_distributions = params_rs)
In [132]:
rs.fit(x_tr[selected_features_bs], y_tr)
Out[132]:
RandomizedSearchCV(cv=10, error_score='raise',
          estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=4,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          fit_params=None, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'criterion': ('entropy', 'gini'), 'splitter': ('best', 'random'), 'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000012369D756A0>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001236DC4B9B0>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000012375669BE0>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)
In [133]:
best_parameters_rs = rs.best_params_
In [134]:
print('Best parameters choosed by RandomizedSearchCV: {}'.format(best_parameters_rs))
Best parameters choosed by RandomizedSearchCV: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 3, 'splitter': 'best'}

Cross validation of a model:

In [135]:
model_3 = DecisionTreeClassifier(**best_parameters_rs)
In [137]:
gini = cv_and_score_model(model_3, x_tr[selected_features_bs], y_tr, True)
Average Gini: 0.978.
Average recall: 0.959.
Stability: 99.234%

RandomizedSearchCV is slightly better than GridSearchCV if we look at Gini and recall. I am going to use parameters choosed by RS.

It could be over of modeling, but lets take a look into feature importace:

Feature importance of model 3.

In [139]:
pd.Series(model_3.feature_importances_, index = selected_features_bs).sort_values(ascending = False)
Out[139]:
chlorides               0.614701
total sulfur dioxide    0.271649
volatile acidity        0.049701
density                 0.036208
pH                      0.023065
fixed acidity           0.004676
quality                 0.000000
alcohol                 0.000000
free sulfur dioxide     0.000000
residual sugar          0.000000
dtype: float64

As 4 of the features are totally useless, I will make one more iteration of modeling, using 2 methods of feature selection.

8.4. Model 4.

Description:

  • Method of feature selection:
    • backward selection,
    • forward selection.
  • Method of algorithm parameters selection: RandomizedSearchCV.

Backward selection.

In [142]:
selected_features_bs_2 = backward_selection(model_3, x_tr[selected_features_bs], y_tr)
In [145]:
gini = cv_and_score_model(model_3, x_tr[selected_features_bs_2], y_tr, True)
Average Gini: 0.985.
Average recall: 0.953.
Stability: 99.553%

Forward selection.

In [147]:
selected_features_fs_2 = forward_selection(model_3, x_tr[selected_features_bs], y_tr)
In [148]:
gini = cv_and_score_model(model_3, x_tr[selected_features_fs_2], y_tr, True)
Average Gini: 0.989.
Average recall: 0.959.
Stability: 99.189%

Forward selection is clearly winner: better Gini, better recall and better stability.

9. Final test.

In [156]:
gini = test_model(model_3, selected_features_fs_2, False)
Gini score: 0.983.
Recall score: 0.962.

10. Model visualisation.

10.1. Decision tree.

In [159]:
dot_data = tree.export_graphviz(model_3, out_file=None, 
                         feature_names=selected_features_fs_2,  
                         class_names=['0','1'],  
                         filled=True, rounded=True,  
                         special_characters=True)
graph = graphviz.Source(dot_data)
In [160]:
out = dot_data[0:14] + 'ranksep=.75; size = "20,30";' + dot_data[14:]
In [ ]:
graph = graphviz.Source(out)

10.2. Sample prediction.

In [204]:
samples = x_te[selected_features_fs_2].sample(5).sort_index()
In [205]:
samples
Out[205]:
chlorides total sulfur dioxide density fixed acidity alcohol
521 0.088 43.0 0.99800 7.6 9.1
727 0.057 101.0 0.99540 7.0 9.4
1248 0.081 36.0 0.99490 6.9 11.1
2600 0.042 132.0 0.99059 6.8 11.3
4355 0.039 137.0 0.98946 6.4 12.7
In [206]:
y_te[y_te.index.isin(samples.index)].sort_index(inplace = True)
In [207]:
pd.DataFrame({'true_y' : y_te[y_te.index.isin(samples.index)], 'predicted_y' : model_3.predict(samples)}, index = samples.index)
Out[207]:
predicted_y true_y
521 1 1
727 0 0
1248 1 1
2600 0 0
4355 0 0

photo: pixabay.com

Podobał Ci się ten artykuł?

Jeśli tak, to zarejestruj się, by otrzymywać informacje o nowych wpisach. Dodatkowo w prezencie wyślę Ci bezpłatny poradnik 🙂

5 Komentarze

  1. Dzięki za ciekawy artykuł. W branży bankowej czy ubezpieczeniowej często nie da się zatrudnić inaczej niż przez agencję, w każdym razie takie są moje doświadczenia. Zwykle odbywa się to ze stratą dla wszystkich oprócz agencji

    • Adam, to zależy o jakim backward selection mówimy. Różne implementacje działają w różny sposób. W ogólnym przypadku chodzi o warunek stopu – implamentacja RFE z sklearn z góry zakłada liczbę zmiennych, którą chcemy osiągnąć.

  2. Wniosek główny (uogólniony): należy unikać/rezygnować/wyłączać pośredników, jeśli tylko się da i jest to możliwe 😉

    Nie tylko w rekrutacji, ale we wszystkich obszarach – zużywają tylko zasoby/energię, którą wykorzystać możemy my. (Chyba, że korzyści z zastosowania pośredników są większe, niż potencjalne straty – dla nas)

    Z tą historią 'history’ w Linuksie to bardzo dobry przykład – gdy pośrednik (nie-ekspert albo po prostu laik w danym temacie) rozmawia z kimś na wyższym poziomie zaawansowania – brak wspólnego języka/wiedzy/doświadczeń często nie pozwoli na porozumienie i zrozumienie.

    Korzystanie z pośredników pozwala na ograniczanie czasu rekrutacji (więc i redukcję kosztu rekrutacji), ale przy rekrutowaniu na stanowiska eksperckie zyski z ich wykorzystania mogą być niższe niż potencjalne straty (np. przez proces rekrutacji nie przejdzie osoba, która z punktu widzenia firmy zatrudniającej byłaby najbardziej optymalna na dane stanowisko).

    PS. Dzięki za dużą ilość materiałów na blogu, na pewno wczytam się dokładnie w całość 😉

  3. Świetny wpis, dużo się nauczyłem, dzięki!

    Kwestionuję zasadność testu na normalność. Test pokazał, że zmienne nie mają idealnie normalnego rozkładu (i jesteśmy tego pewni ze względu na bardzo małe p-value). To jednak wiemy a priori, bo żadne rzeczywiste dane nie mają idealnie normalnego rozkładu.

    Wbrew intuicji, test *nie* daje odpowiedzi jak bardzo dane odbiegają od rozkładu normalnego i jego wynik *nie* dyskwalifikuje wykorzystania technik formalnie zakładających rozkład normalny (bo te mają dużą tolerancję).

    Tradycyjne testy na normalność wydają się bezużyteczne w kontekście ML bo dla dużej próby rzeczywistych danych (tu ~6500) niemal zawsze odrzucą hipotezę zerową. W praktyce nie potrzebujemy jednak idealnie normalnych rozkładów by stosować popularne techniki i narzędzia. Jest fajny wątek na Cross Validated o tym.

Dodaj komentarz

Twój adres email nie zostanie opublikowany.


*