开源软件名称(OpenSource Name):justmarkham/pycon-2016-tutorial开源软件地址(OpenSource Url):https://github.com/justmarkham/pycon-2016-tutorial开源编程语言(OpenSource Language):Jupyter Notebook 82.8%开源软件介绍(OpenSource Introduction):Tutorial: Machine Learning with Text in scikit-learnPresented by Kevin Markham at PyCon on May 28, 2016. Watch the complete tutorial video on YouTube. DescriptionAlthough numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn. ObjectivesBy the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation. Required SoftwareAttendees will need to bring a laptop with scikit-learn and pandas (and their dependencies) already installed. Installing the Anaconda distribution of Python is an easy way to accomplish this. Both Python 2 and 3 are welcome. I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice. Tutorial Files
Prerequisite KnowledgeAttendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required.
AbstractIt can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on... In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance. Detailed Outline
About the InstructorKevin Markham is the founder of Data School and the former lead instructor for General Assembly's Data Science course in Washington, DC. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds, and he enjoys teaching both online and in the classroom. Kevin's professional focus is supervised machine learning, which led him to create the popular scikit-learn video series for Kaggle. He has a degree in Computer Engineering from Vanderbilt University.
Recommended ResourcesText classification:
Naive Bayes and logistic regression:
scikit-learn:
pandas:
|
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论