开源软件名称(OpenSource Name):GeorgeSeif/Python-Machine-Learning开源软件地址(OpenSource Url):https://github.com/GeorgeSeif/Python-Machine-Learning开源编程语言(OpenSource Language):Python 100.0%开源软件介绍(OpenSource Introduction):Python Machine LearningDescriptionA set of machine learing algorithms implemented in Python 3.5. Please also see my related repository for Python Data Science which contains various data science scripts for data analysis and visualisation. Programs can one of three implementations:
The included programs are:
InformationHere are the descriptions of the above machine learning algorithms Linear and Logistic RegressionRegression is a technique used to model and analyze the relationships between variables and often times how they contribute and are related to producing a particular outcome together. Beginning with the simple case, Single Variable Linear Regression is a technique used to model the relationship between a single input independant variable (feature variable) and an output dependant variable using a linear model i.e a line. The more general case is Multi Variable Linear Regression where a model is created for the relationship between multiple independant input variables (feature variables) and an output dependant variable. The model remains linear in that the output is a linear combination of the input variables. There is a third most general case called Polynomial Regression where the model now becomes a non-linear combination of the feature variables; this however requires knowledge of how the data relates to the output. For all of these cases of the regression the output variable is a real-number (rather than a class category). We can also do logistic regression where instead of predicting a real-number, we predict the class or group that the input variable represent. This can be done by modifying the regression training such that the error is computed as the probability that the current example belongs to a particular class. This can be done simply by taking the sigmoid of the regular linear regression result and using a one vs. all scheme, or simple applying the Softmax function. Regression models can be trained using Stochastic Gradient Descent (SGD). Regression is fast to model and is particularly useful when the relationship to be modelled is not extremely complex (if it is complex, better off using something like a neural network). Neural NetworksA Neural Network consists of an interconnected group of nodes called neurons. The input feature variables from the data are passed to these neurons as a multi-variable linear combination, where the values multiplied by each feature variable are known as weights. A non-linearity is then applied to this linear combination which gives the neural network the ability to model complex non-linear relationships. A neural network can have multiple layers where the output of one layer is passed to the next one in the same way. At the output, there is generally no non-linearity applied. Neural Networks are trained using Stochastic Gradient Descent (SGD) and the backpropagation algorithm. Neural networks can be used for either classification or regression in the same way as the linear and logistic regressions' description above. Since neural networks can have many layers with non-linearities, they are most suitable for modelling complex non-linear relationships in which taking the time to train them properly is worthwhile. Naive BayesNaive Bayes is a classification technique based on Bayes' Theorem with the assumption that all feature variables are independant. Thus a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes classifier combines 3 terms to compute the probability of a class: the class probability in the dataset, multiplied by the probability of the example feature variables occuring given the current class, divided by the probability of those particular example feature variables occuring in general. To compute the probability of particular feature variables occuring there are 3 main optional techniques. One can assume that the value of a particular variable is Gaussian distributed which can be a common case and thus this method is useful when the variables are real numbers. Multinomial division is good for feature variables that are categorical as it computes the probability based on histogram bins. The final option is to use a Bernouli probability model when the data is binary. Naive Bayes is simplistic and easy to use yet can outperform other more complex classification algorithms. Is has fast computation and thus is well suited for application on large datasets. Support Vector MachinesSupport Vector Machines (SVMs) are a learning algorithm mainly used for classification where the learned model is a set of hyperplanes that seperate the data into the respective classes. The hyperplanes are selected by learning the parameters that maximize the distance from the support vectors to the hyperplane, where the support vectors are the closest vectors to the plane. If hyper-plane having low margin is selected, then there is high chance of miss-classification. If the data is not linearly seperable in the current dimensional space, a kernel function can be used to transform the data into a higher dimensional space where it is seperable via a linear hyperplance. SVMs have the strong advantage that they are highly effective in high dimensional feature space and at modelling complex relationships due to the fact that their kernal functions can be specified. Specifying a particular kernel function may require knowledge of the data, however there are some common default ones that often work well in practice. Additionally, the training of the SVMs only uses a very small subset of the data i.e the support vectors and is thus very memory efficient. The disadvantage to SVMs is there long training time as effective training involves complex quadratic programming algorithms. Decision Trees and Random ForestsBeginning with the base case, a Decision Tree is an intuitive model where by one traverses down the branches of the tree and selects the next branch to go down based on a decision at a node. The model- (or tree-) building aspect of decision tree classification algorithms are composed of 2 main tasks: tree induction and tree pruning. Tree induction is the task of taking a set of pre-classified instances as input, deciding which attributes are best to split on, splitting the dataset, and recursing on the resulting split datasets until all training instances are categorized. While building the tree, the goal is to split on the attributes which create the purest child nodes possible, which would keep to a minimum the number of splits that would need to be made in order to classify all instances in our dataset. This purity is generally measured by one of a number of different attribute selection measures. Purity is measured by the concept of information gain, which relates to how much would need to be known about a previously-unseen instance in order for it to be properly classified. In practice, this is measured by comparing entropy, or the amount of information needed to classify a single instance of a current dataset partition, to the amount of information to classify a single instance if the current dataset partition were to be further partitioned on a given attribute. Because of the nature of training decision trees they can be prone to major overfitting. A completed decision tree model can be overly-complex, contain unnecessary structure, and be difficult to interpret. Tree pruning is the process of removing the unnecessary structure from a decision tree in order to make it more efficient, more easily-readable for humans, and more accurate as well. Tree pruning is generally done by compressing part of the tree into less strict and rigid decision boundaries into ones that are more smooth and generalize better. Scikit learn comes with a biult in tool to visualize the full decision tree, including the specific decision boundaries set by each node. Random Forests are simply an ensemble of decision trees. The input vector is run through multiple decision trees. For regression, the output value of all the trees is averaged; for classification a voting scheme is used to determine the final class. ClusteringClustereing is an unsupervised learning technique which involves grouping the data points based on their intrinsic feature variable similarity. A number of clustering algorithms are outlined below:
Scikit Learn Machine Learning Cheat SheetHelpersIn addition the the main algorithm files, we have the following set of helper functions in the "ml_helpers.py" file:
Requirements
InstallationThe above packages can be installed by running the commands listed in the "install.txt" file |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论