mdegis/machine-learning: machine learning examples in python

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称（OpenSource Name）：

mdegis/machine-learning

开源软件地址(OpenSource Url)：

https://github.com/mdegis/machine-learning

开源编程语言(OpenSource Language)：

Python 100.0%

开源软件介绍(OpenSource Introduction)：

Machine Learning in Python

This repository contains some assignments and exercises of Machine Learning course of Sebastian Thrun and Katie Malone.

Please read the license file. I do NOT take any responsibility in case of plagiarism. It's ONLY educational purpose.

Mini-Projects:

Before try to run any of the assignments, please run 'startup.py' in 'tools' directory. This will automatically install e-mail data set and extract it (which is about 423 MB in tgz, 1.4 GB when decompressed).

Lesson 1: Naive Bayes

Naive Bayes Classifier is used to identify emails by their authors.
Also time performance and accuracy are calculated.
Decide whether car should go fast or slow depending on bumpiness level of road.

Lesson 2: SVM

In this mini-project, we’ll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with an SVM. What we find will help clarify some of the practical differences between the two algorithms. This project also gives us a chance to play around with parameters a lot more than Naive Bayes did, so we will do that too.

Read the comments in the code, for more information.

Lesson 3: Decision Tree

In this project we'll be tackling the same project that we've done with our last two supervised classification algorithms. We're trying to understand who wrote an email based on the word content of that email. This time we'll be using a decision tree. We'll also dig into the features that we use a little bit more. This'll be a dedicated topic in the latter part of the class. What features give you the most effective, the most accurate supervised classification algorithm?

Read the comments in the code, for more information.

Overfitted example: Fixed:

Lesson 4: AdaBoost (Adaptive Boosting), kNN and Random Forrest

AdaBoost:

While every learning algorithm will tend to suit some problem types better than others, and will typically have many different parameters and configurations to be adjusted before achieving optimal performance on a dataset, AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder to classify examples.

A great article about AdaBoost can be found at https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf

k Nearest Neighbors:

Neighbors-based classification is a type of instance-based learning or non-generalizing learning; it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Random Forrest:

The Random Forrest (ensemble learning method [like AdaBoost] for classification, regression) method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees with controlled variance. The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement classification proposed by Eugene Kleinberg.

Lesson 5: Dataset and Questions

The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. We’ve combined the email and finance data into a single dataset, which you’ll explore in this mini-project.

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person. The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly.

Lesson 6: Regression

In this project, we will use regression to predict financial data for Enron employees and associates. Once we know some financial data about an employee, like their salary, what would you predict for the size of their bonus?

Read the comments in the code, for more information.

Lesson 7: Outliers

Having large outliers can have a big effect on your regression result. So in the first part of this mini project, you're going to implement the algorithm that is you take the 10% or so of data points that have the largest residuals, relative to your regression. You remove them, and then you refit the regression, and you see how the result changes.

The second thing we'll do is take a closer at the Enron data. This time with a particular eye towards outliers. You'll find very quickly that there are some data points that fall far outside of the general pattern.

Lesson 8: Unsupervised Learning (K-Means Clustering)

In this project, we'll apply k-means clustering to our Enron financial data. Our final goal, of course, is to identify persons of interest; since we have labeled data, this is not a question that particularly calls for an unsupervised approach like k-means clustering.

Nonetheless, you'll get some hands-on practice with k-means in this project, and play around with feature scaling, which will give you a sneak preview of the next lesson's material.

Great online tool to visualize k-Means Cluster algorithm can be founded at http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Lesson 9: Feature Scaling

In the mini-project on K-means clustering, you clustered the data points. And then at the end, we sort of gestured towards feature scaling as something that could change the output of that clustering algorithm. In this mini-project, you'll actually deploy the feature scaling yourself. So you'll take the code from the K-means clustering algorithm and add in the feature scaling and then in doing so, you'll be recreating the steps that we took to make those new clusters.

salary = []
for i in data_dict:
    if (data_dict[i][feature_1]=='NaN'):
        salary.append(0.0)
        # pass
    else:    
        salary.append(float(data_dict[i][feature_1]))
ma= max(salary)        
mi=min(salary)
print "salary maximum: ", ma, " minimum: ", mi
# maximum:  1111258.0  minimum:  477.0 comment out line 121 to get rid of zeroes.
print float(200000-mi)/(ma-mi)

salary_ = numpy.array(salary)
salary_ = salary_.astype(float)
scaler = MinMaxScaler()
rescaled_salary = scaler.fit_transform(salary_)
print rescaled_salary

Lesson 10: Text Learning

In the beginning of this class, you identified emails by their authors using a number of supervised classification algorithms. In those projects, we handled the preprocessing for you, transforming the input emails into a TfIdf so they could be fed into the algorithms. Now you will construct your own version of that preprocessing step, so that you are going directly from raw data to processed features.

You will be given two text files: one contains the locations of all the emails from Sara, the other has emails from Chris. You will also have access to the parseOutText() function, which accepts an opened email as an argument and returns a string containing all the (stemmed) words in the email.

Read the comments in the code, for more information ([This code](/010 - Text Learning/vectorize_text.py) contains some cool examples).

Lesson 11: Feature Selection

In one of the earlier videos in this lesson we told you about that there was a word that was effectively serving as a signature on the e-mails and we didn't initially realize it. Now, the mark of a good machine learner doesn't mean that they never make any mistakes or that their features are always perfect. It means that they're on the lookout for ways to check this and to figure out if there is a bug in there that they need to go in and fix.

So in this case it would mean that there's a type of signature word, that we would need to go in and remove in order for us to, to feel like we were being fair in our supervised classification. We will be working on a problem that arose in preparing Chris and Sara’s email for the author identification project; it had to do with a feature that was a little too powerful (effectively acting like a signature, which gives an arguably unfair advantage to an algorithm). You’ll work through that discovery process here.

Read the comments in the [code](/011 - Feature Selection/find_signature.py), for more information.

Lesson 12: Principal Component Analysis

PCA example by Eigenfaces (face recognition).

===================================================
Faces recognition example using eigenfaces and SVMs
===================================================

The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:

  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

  .. _LFW: http://vis-www.cs.umass.edu/lfw/

  original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html


2015-08-05 22:25:02,927 Loading LFW people faces from /home/mdegis/scikit_learn_data/lfw_home
Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7
Extracting the top 150 eigenfaces from 966 faces
done in 4.846s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.372s
Fitting the classifier to the training set
done in 18.366s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='auto', coef0=0.0, degree=3,
  gamma=0.005, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
Predicting the people names on the testing set
done in 0.061s
                   precision    recall  f1-score   support

     Ariel Sharon       1.00      0.62      0.76        13
     Colin Powell       0.78      0.82      0.80        60
  Donald Rumsfeld       0.87      0.74      0.80        27
    George W Bush       0.82      0.92      0.87       146
Gerhard Schroeder       0.90      0.76      0.83        25
      Hugo Chavez       1.00      0.53      0.70        15
       Tony Blair       0.77      0.75      0.76        36

      avg / total       0.83      0.83      0.82       322

[[  8   2   0   3   0   0   0]
 [  0  49   2   6   0   0   3]
 [  0   0  20   5   0   0   2]
 [  0  11   0 135   0   0   0]
 [  0   0   0   4  19   0   2]
 [  0   0   0   5   1   8   1]
 [  0   1   1   6   1   0  27]]

013 - Validation

data_dict = pickle.load(open("final_project_dataset.pkl", "r"))

# first element is our labels, any added elements are predictor
# features. Keep this the same for the mini-project, but you'll
# have a different feature list when you do the final project.
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

# without spliting data part:
clf = DecisionTreeClassifier()
# clf.fit(features,labels)
# print clf.score(features, labels) # 0.989473684211

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(
	features, labels, test_size=0.3, random_state=42)
clf.fit(features_train, labels_train)
print clf.score(features_test, labels_test) # 0.724137931034

Lesson 14: Evaluation Metrics

In the last lesson, you created your first person of interest identifier algorithm, and you set up a training and testing framework so that you could assess the accuracy of that model.

Now that you know much more about evaluation metrics, we're going to have you deploy those evaluation metrics on the framework that you've set up in the last lesson. And so, by the time you get to the final project, the main thing that you'll have to think about will be the features that you want to use, the algorithm that you want to use, and any parameter tunes. You'll already have the training and testing, and the evaluation matrix all set up.

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

ashishpatel26/Last-Minute-Notes-of-Machine-learning-and-Deep-learning: Last Minu ...发布时间：2022-08-19

rohan-paul/MachineLearning-DeepLearning-Code-for-my-YouTube-Channel: The full co ...发布时间：2022-08-19

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：19691|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：10121|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8412|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8769|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8724|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9783|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8710|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：8076|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8758|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7603|2022-11-06

客服电话

电子邮件

mdegis/machine-learning: machine learning examples in python

开源软件名称（OpenSource Name）：

开源软件地址(OpenSource Url)：

开源编程语言(OpenSource Language)：

开源软件介绍(OpenSource Introduction)：

Machine Learning in Python

Mini-Projects:

Lesson 1: Naive Bayes

Lesson 2: SVM

Lesson 3: Decision Tree

Lesson 4: AdaBoost (Adaptive Boosting), kNN and Random Forrest

Lesson 5: Dataset and Questions

Lesson 6: Regression

Lesson 7: Outliers

Lesson 8: Unsupervised Learning (K-Means Clustering)

Lesson 9: Feature Scaling

Lesson 10: Text Learning

Lesson 11: Feature Selection

Lesson 12: Principal Component Analysis

013 - Validation

Lesson 14: Evaluation Metrics

请发表评论

全部评论

上一篇：

下一篇：

CVE-2022-34600

公分和厘米一样吗？

dphi-official/Machine_Learning_Bootcamp

juven/maven-bash-completion: Maven Bash

win7系统注册表编辑器打开的操作方法

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053