深圳新秀租房贵吗_再次根据新秀表现对NBA职业生涯进行预测...-白红宇

深圳新秀租房贵吗_再次根据新秀表现对NBA职业生涯进行预测...

阅读量：2517 次

发布时间：2019-05-11

本文共 15531 字，大约阅读时间需要 51 分钟。

深圳新秀租房贵吗

Now that the NBA season is done, we have complete data from this year’s NBA rookies. In the past I have tried to predict NBA rookies’ future performance using . In this post I am again trying to predict rookies’ future performance, but now using using a . When using a classification approach, I predict whether player X will be a “great,” “average,” or “poor” player rather than predicting exactly how productive player X will be.

现在NBA赛季已经结束，我们已经掌握了今年NBA新秀的全部数据。过去，我曾尝试使用来预测NBA新秀的未来表现。在本文中，我再次尝试预测新秀的未来表现，但现在使用。当使用分类方法时，我预测参与者X是“伟大”，“平均”还是“贫穷”参与者，而不是确切预测参与者X的生产率。

Much of this post re-uses code from the previous posts, so I skim over some of the repeated code.

这篇文章中的大部分重复使用了先前文章中的代码，因此我略过了一些重复的代码。

As usual, I will post all code as a jupyter notebook on my .

和往常一样，我将所有代码作为jupyter笔记本发布在。

11223344556677

Load the data. Reminder – this data is available on my .

加载数据。提醒-这些数据可在我的。

1122334455667788

Load more data, and normalize the data for the .

加载更多数据，并为标准化数据。

1122334455667788

In the past I used to group players according to their performance (see my post on for more info). Here, I use a (GMM) to group the players. I use the GMM model because it assigns each player a “soft” label rather than a “hard” label. By soft label I mean that a player simultaneously belongs to several groups. For instance, Russell Westbrook belongs to both my “point guard” group and my “scorers” group. K-means uses hard labels where each player can only belong to one group. I think the GMM model provides a more accurate representation of players, so I’ve decided to use it in this post. Maybe in a future post I will spend more time describing it.

过去，我使用根据球员的表现对进行有关更多信息，请参阅我的文章）。在这里，我使用（GMM）对玩家进行分组。我使用GMM模型是因为它为每个玩家分配了一个“软”标签，而不是一个“硬”标签。软标签是指玩家同时属于多个组。例如，罗素·威斯布鲁克（Russell Westbrook）属于我的“控球后卫”组和我的“得分手”组。 K-means使用硬标签，其中每个玩家只能属于一个组。我认为GMM模型可以更准确地表示球员，因此我决定在本文中使用它。也许在以后的文章中，我会花更多的时间来描述它。

For anyone wondering, the GMM groupings looked pretty similar to the k-means groupings.

对于任何想知道的人，GMM分组看起来都非常类似于k-means分组。

1122334455667788991010111112121313

In this past I have attempted to predict win shares per 48 minutes. I am using win shares as a dependent variable again, but I want to categorize players.

在过去，我试图预测每48分钟的获胜份额。我再次将获胜份额用作因变量，但我想对球员进行分类。

Below I create a histogram of players’ win shares per 48.

下面，我创建了每48个玩家获胜份额的直方图。

I split players into 4 groups which I will refer to as “bad,” “below average,” “above average,” and “great”: Poor players are the bottom 10% in win shares per 48, Below average are the 10-50th percentiles, Above average and 50-90th percentiles, Great are the top 10%. This assignment scheme is relatively arbitrary; the model performs similarly with different assignment schemes.

我将玩家分为4组，分别称为“差”，“低于平均水平”，“高于平均水平”和“伟大”：差的玩家是每48个获胜份额中排名倒数10％的玩家，低于平均水平的是10-前10％高于平均水平的50％和50-90％的百分位数。这种分配方案是相对任意的。该模型在不同的分配方案下的表现类似。

1122334455667788

[0.096314496314496317, 0.40196560196560199, 0.39950859950859952, 0.10221130221130222][0.096314496314496317, 0.40196560196560199, 0.39950859950859952, 0.10221130221130222]

My goal is to use rookie year performance to classify players into these 4 categories. I have a big matrix with lots of data about rookie year performance, but the reason that I grouped player using the GMM is because I suspect that players in the different groups have different “paths” to success. I am including the groupings in my classification model and computing interaction terms. The interaction terms will allow rookie performance to produce different predictions for the different groups.

我的目标是利用新秀年度表现将球员分为以下4类。我有一个大矩阵，其中包含有关新秀年度表现的大量数据，但之所以使用GMM对球员进行分组是因为我怀疑不同组中的球员有不同的“成功道路”。我将分类包括在分类模型和计算交互项中。交互条件将允许菜鸟表现为不同的群体产生不同的预测。

By including interaction terms, I include quite a few predictor features. I’ve printed the number of predictor features and the number of predicted players below.

通过包括交互项，我包括了很多预测器功能。我在下面打印了预测功能的数量和预测玩家的数量。

from sklearn import preprocessingdf_drop = df[df['Year']>1980]for x in np.unique(new_labels):    Label = 'Category%d' % x    rookie_df_drop[Label] = df_drop[Label] #give rookies the groupings produced by the GMM modelX = rookie_df_drop.as_matrix() #take data out of dataframe   poly = preprocessing.PolynomialFeatures(2,interaction_only=True) #create interaction terms.X = poly.fit_transform(X)Career_data = df[df['Year']>1980]Y = Career_data['perf_cat'] #get predictor dataprint(np.shape(X))print(np.shape(Y))from sklearn import preprocessing  df_drop = df [ df [ 'Year' ] > 1980 ] for x in np . unique ( new_labels ):     Label = 'Category %d ' % x     rookie_df_drop [ Label ] = df_drop [ Label ] #give rookies the groupings produced by the GMM model  X = rookie_df_drop . as_matrix () #take data out of dataframe     poly = preprocessing . PolynomialFeatures ( 2 , interaction_only = True ) #create interaction terms. X = poly . fit_transform ( X )  Career_data = df [ df [ 'Year' ] > 1980 ] Y = Career_data [ 'perf_cat' ] #get predictor data print ( np . shape ( X )) print ( np . shape ( Y ))

Now that I have all the features, it’s time to try and predict which players will be poor, below average, above average, and great. To create these predictions, I will use a .

现在，我已经具备了所有功能，现在该尝试预测哪些球员将是贫穷，低于平均水平，高于平均水平和出色的球员。为了创建这些预测，我将使用。

Because I have so many predictors, correlation between predicting features and over-fitting the data are major concerns. I use and to combat these issues.

因为我有很多预测变量，所以预测特征与数据过度拟合之间的相关性是主要问题。我使用和来解决这些问题。

Specifically, I am using and k-fold 5 cross-validation. Within the cross-validation, I am trying to estimate how much regularization is appropriate.

具体来说，我正在使用和k倍5交叉验证。在交叉验证中，我试图估计多少正则化是合适的。

Some important notes – I am using “balanced” weights which tells the model that worse to incorrectly predict the poor and great players than the below average and above average players. I do this because I don’t want the model to completely ignore the less frequent classifications. Second, I use the multi_class multinomial because it limits the number of models I have to fit.

一些重要注意事项–我正在使用“平衡”权重，该权重告诉模型与平均水平以下和平均水平以上的参与者相比，错误地预测贫穷和伟大的参与者更为糟糕。我这样做是因为我不希望模型完全忽略频率较低的分类。其次，我使用multi_class多项式，因为它限制了我必须适合的模型数量。

112233445566778899

0.7381092190250.738109219025

Okay, the model did pretty well, but lets look at where the errors are coming from. To visualize the models accuracy, I am using a . In a confusion matrix, every item on the diagnonal is a correctly classified item. Every item off the diagonal is incorrectly classified. The color bar’s axis is the percent correct. So the dark blue squares represent cells with more items.

好的，模型做得很好，但是让我们看一下错误的出处。为了可视化模型的准确性，我使用了一个。在混淆矩阵中，诊断中的每个项目都是正确分类的项目。对角线上的每个项目均未正确分类。彩条的轴是正确的百分比。因此，深蓝色方块代表具有更多项目的单元格。

It seems the model is best at predicting poor players and great players. It makes more errors when trying to predict the more average players.

似乎该模型最适合预测不良参与者和优秀参与者。尝试预测更多平均玩家时，它会产生更多错误。

from sklearn.metrics import confusion_matrixcm = confusion_matrix(Y, est.predict(X))def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):    plt.imshow(cm, interpolation='nearest', cmap=cmap,vmin=0.0, vmax=1.0)    plt.title(title)    plt.colorbar()    tick_marks = np.arange(len(np.unique(df['perf_cat'])))    plt.xticks(tick_marks, np.unique(df['perf_cat']))    plt.yticks(tick_marks, np.unique(df['perf_cat']))    plt.tight_layout()    plt.ylabel('True label')    plt.xlabel('Predicted label')cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')from sklearn.metrics import confusion_matrix cm = confusion_matrix ( Y , est . predict ( X ))  def plot_confusion_matrix ( cm , title = 'Confusion matrix' , cmap = plt . cm . Blues ):     plt . imshow ( cm , interpolation = 'nearest' , cmap = cmap , vmin = 0.0 , vmax = 1.0 )     plt . title ( title )     plt . colorbar ()     tick_marks = np . arange ( len ( np . unique ( df [ 'perf_cat' ])))     plt . xticks ( tick_marks , np . unique ( df [ 'perf_cat' ]))     plt . yticks ( tick_marks , np . unique ( df [ 'perf_cat' ]))     plt . tight_layout ()     plt . ylabel ( 'True label' )     plt . xlabel ( 'Predicted label' )  cm_normalized = cm . astype ( 'float' ) / cm . sum ( axis = 1 )[:, np . newaxis ] plot_confusion_matrix ( cm_normalized , title = 'Normalized confusion matrix' )

Lets look at what the model predicts for this year’s rookies. Below I modified two functions that I wrote for a previous post. The first function finds a particular year’s draft picks. The second function produces predictions for each draft pick.

让我们看看该模型对今年新秀的预测。下面，我修改了为上一篇文章编写的两个函数。第一个功能查找特定年份的选秀权。第二个功能为每个选秀权产生预测。

def gather_draftData(Year):    import urllib2    from bs4 import BeautifulSoup    import pandas as pd    import numpy as np    draft_len = 30    def convert_float(val):        try:            return float(val)        except ValueError:            return np.nan    url = 'http://www.basketball-reference.com/draft/NBA_'+str(Year)+'.html'    html = urllib2.urlopen(url)    soup = BeautifulSoup(html,"lxml")    draft_num = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[0].text for i in range(draft_len)]    draft_nam = [soup.findAll('tbody')[0].findAll('tr')[i].findAll('td')[3].text for i in range(draft_len)]    draft_df = pd.DataFrame([draft_num,draft_nam]).T    draft_df.columns = ['Number','Name']    df.index = range(np.size(df,0))    return draft_dfdef player_prediction__regressionModel(PlayerName):    clust_df = pd.read_pickle('nba_bballref_career_stats_2016_Apr_15.pkl')    clust_df = clust_df[clust_df['Name']==PlayerName]    clust_df = clust_df.drop(['Year','Name','G','GS','MP','FG','FGA','FG%','3P','2P','FT','TRB','PTS','ORtg','DRtg','PER','TS%','3PAr','FTr','ORB%','DRB%','TRB%','AST%','STL%','BLK%','TOV%','USG%','OWS','DWS','WS','WS/48','OBPM','DBPM','BPM','VORP'],1)    new_vect = ScaleModel.transform(clust_df.as_matrix().reshape(1,-1))    reduced_data = reduced_model.transform(new_vect)    predictions = g.predict_proba(reduced_data)    for x in np.unique(new_labels):        Label = 'Category%d' % x        clust_df[Label] = predictions[:,x]    Predrookie_df = pd.read_pickle('nba_bballref_rookie_stats_2016_Apr_16.pkl')    Predrookie_df = Predrookie_df[Predrookie_df['Name']==PlayerName]    Predrookie_df = Predrookie_df.drop(['Year','Career Games','Name'],1)    for x in np.unique(new_labels):        Label = 'Category%d' % x        Predrookie_df[Label] = clust_df[Label] #give rookies the groupings produced by the GMM model    predX = Predrookie_df.as_matrix() #take data out of dataframe    predX = poly.fit_transform(predX)    predictions2 = est.predict_proba(predX)    return {
            'Name':PlayerName,'Group':predictions,'Prediction':predictions2[0]}def gather_draftData ( Year ):      import urllib2     from bs4 import BeautifulSoup     import pandas as pd     import numpy as np      draft_len = 30      def convert_float ( val ):         try :             return float ( val )         except ValueError :             return np . nan      url = 'http://www.basketball-reference.com/draft/NBA_' + str ( Year ) + '.html'     html = urllib2 . urlopen ( url )     soup = BeautifulSoup ( html , "lxml" )      draft_num = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 0 ] . text for i in range ( draft_len )]     draft_nam = [ soup . findAll ( 'tbody' )[ 0 ] . findAll ( 'tr' )[ i ] . findAll ( 'td' )[ 3 ] . text for i in range ( draft_len )]      draft_df = pd . DataFrame ([ draft_num , draft_nam ]) . T     draft_df . columns = [ 'Number' , 'Name' ]     df . index = range ( np . size ( df , 0 ))     return draft_df  def player_prediction__regressionModel ( PlayerName ):      clust_df = pd . read_pickle ( 'nba_bballref_career_stats_2016_Apr_15.pkl' )     clust_df = clust_df [ clust_df [ 'Name' ] == PlayerName ]     clust_df = clust_df . drop ([ 'Year' , 'Name' , 'G' , 'GS' , 'MP' , 'FG' , 'FGA' , 'FG%' , '3P' , '2P' , 'FT' , 'TRB' , 'PTS' , 'ORtg' , 'DRtg' , 'PER' , 'TS%' , '3PAr' , 'FTr' , 'ORB%' , 'DRB%' , 'TRB%' , 'AST%' , 'STL%' , 'BLK%' , 'TOV%' , 'USG%' , 'OWS' , 'DWS' , 'WS' , 'WS/48' , 'OBPM' , 'DBPM' , 'BPM' , 'VORP' ], 1 )     new_vect = ScaleModel . transform ( clust_df . as_matrix () . reshape ( 1 , - 1 ))     reduced_data = reduced_model . transform ( new_vect )     predictions = g . predict_proba ( reduced_data )     for x in np . unique ( new_labels ):         Label = 'Category %d ' % x         clust_df [ Label ] = predictions [:, x ]      Predrookie_df = pd . read_pickle ( 'nba_bballref_rookie_stats_2016_Apr_16.pkl' )     Predrookie_df = Predrookie_df [ Predrookie_df [ 'Name' ] == PlayerName ]     Predrookie_df = Predrookie_df . drop ([ 'Year' , 'Career Games' , 'Name' ], 1 )     for x in np . unique ( new_labels ):         Label = 'Category %d ' % x         Predrookie_df [ Label ] = clust_df [ Label ] #give rookies the groupings produced by the GMM model     predX = Predrookie_df . as_matrix () #take data out of dataframe     predX = poly . fit_transform ( predX )     predictions2 = est . predict_proba ( predX )     return { 'Name' : PlayerName , 'Group' : predictions , 'Prediction' : predictions2 [ 0 ]}

Below I create a plot depicting the model’s predictions. On the y-axis are the four classifications. On the x-axis are the players from the 2015 draft. Each cell in the plot is the probability of a player belonging to one of the classifications. Again, dark blue means a cell or more likely. Good news for us T-Wolves fans! The model loves KAT.

在下面，我创建了一个描述模型预测的图。在y轴上有四个分类。 x轴上是2015年选秀的球员。情节中的每个像元都是玩家属于其中一种分类的概率。同样，深蓝色表示一个细胞或更可能。对我们T-狼队的球迷们来说是个好消息！该模特喜欢KAT。

draft_df = gather_draftData(2015)draft_df['Name'][14] =  'Kelly Oubre Jr.' #annoying name inconsistencies plt.subplots(figsize=(14,6));draft_df = draft_df.drop(25, 0) #spurs' 1st round pick has not played yetpredictions = []for name in draft_df['Name']:    draft_num = draft_df[draft_df['Name']==name]['Number']    predict_dict = player_prediction__regressionModel(name)    predictions.append(predict_dict['Prediction'])plt.imshow(np.array(predictions).T, interpolation='nearest', cmap=plt.cm.Blues,vmin=0.0, vmax=1.0)plt.title('Predicting Future Performance of 2015-16 Rookies')plt.colorbar(shrink=0.25)tick_marks = np.arange(len(np.unique(df['perf_cat'])))plt.xticks(range(0,29),draft_df['Name'],rotation=90)plt.yticks(range(0,4), ['Poor','Below Average','Above Average','Great'])plt.tight_layout()plt.ylabel('Prediction')plt.xlabel('Draft Position');draft_df = gather_draftData ( 2015 )  draft_df [ 'Name' ][ 14 ] =  'Kelly Oubre Jr.' #annoying name inconsistencies   plt . subplots ( figsize = ( 14 , 6 ));  draft_df = draft_df . drop ( 25 , 0 ) #spurs' 1st round pick has not played yet  predictions = [] for name in draft_df [ 'Name' ]:     draft_num = draft_df [ draft_df [ 'Name' ] == name ][ 'Number' ]     predict_dict = player_prediction__regressionModel ( name )     predictions . append ( predict_dict [ 'Prediction' ])  plt . imshow ( np . array ( predictions ) . T , interpolation = 'nearest' , cmap = plt . cm . Blues , vmin = 0.0 , vmax = 1.0 ) plt . title ( 'Predicting Future Performance of 2015-16 Rookies' ) plt . colorbar ( shrink = 0.25 ) tick_marks = np . arange ( len ( np . unique ( df [ 'perf_cat' ]))) plt . xticks ( range ( 0 , 29 ), draft_df [ 'Name' ], rotation = 90 ) plt . yticks ( range ( 0 , 4 ), [ 'Poor' , 'Below Average' , 'Above Average' , 'Great' ]) plt . tight_layout () plt . ylabel ( 'Prediction' ) plt . xlabel ( 'Draft Position' );