博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
自然语言27_Converting words to Features with NLTK
阅读量:5811 次
发布时间:2019-06-18

本文共 3558 字,大约阅读时间需要 11 分钟。

 

(博客主亲自录制视频教程)

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/

Converting words to Features with NLTK

In this tutorial, we're going to be building off the previous video and compiling feature lists of words from positive reviews and words from the negative reviews to hopefully see trends in specific types of words in positive or negative reviews.

To start, our code:

import nltkimport randomfrom nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) all_words = [] for w in movie_reviews.words(): all_words.append(w.lower()) all_words = nltk.FreqDist(all_words) word_features = list(all_words.keys())[:3000]

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we're going to build a quick function that will find these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:

def find_features(document): words = set(document) features = {} for w in word_features: features[w] = (w in words) return features

Next, we can print one feature set like:

print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing:

featuresets = [(find_features(rev), category) for (rev, category) in documents]

Awesome, now that we have our features and labels, what is next? Typically the next step is to go ahead and train an algorithm, then test it. So, let's go ahead and do that, starting with the Naive Bayes classifier in the next tutorial!

 

 

 

 

# -*- coding: utf-8 -*-"""Created on Sun Dec  4 09:27:48 2016@author: daxiong"""import nltkimport randomfrom nltk.corpus import movie_reviewsdocuments = [(list(movie_reviews.words(fileid)), category)             for category in movie_reviews.categories()             for fileid in movie_reviews.fileids(category)]random.shuffle(documents)all_words = []for w in movie_reviews.words():    all_words.append(w.lower())#dict_allWords是一个字典,存储所有文字的频率分布dict_allWords = nltk.FreqDist(all_words)#字典keys()列出所有单词,[:3000]表示列出前三千文字word_features = list(dict_allWords.keys())[:3000]''' 'combating', 'mouthing', 'markings', 'directon', 'ppk', 'vanishing', 'victories', 'huddleston', ...]'''def find_features(document):    words = set(document)    features = {}    for w in word_features:        features[w] = (w in words)    return featureswords=movie_reviews.words('neg/cv000_29416.txt')'''Out[78]: ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]type(words)Out[65]: nltk.corpus.reader.util.StreamBackedCorpusView'''#去重,words1为集合形式words1 = set(words)'''words1{'!', '"', '&', "'", '(', ')',....... 'witch', 'with', 'world', 'would', 'wrapped', 'write', 'world', 'would', 'wrapped', 'write', 'years', 'you', 'your'}'''features = {}#victories单词不在words1,输出false('victories' in words1)'''Out[73]: False ''' features['victories'] = ('victories' in words1)'''featuresOut[75]: {'victories': False}'''print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))''''schwarz': False, 'supervisors': False, 'geyser': False, 'site': False, 'fevered': False, 'acknowledged': False, 'ronald': False, 'wroth': False, 'degredation': False, ...}'''featuresets = [(find_features(rev), category) for (rev, category) in documents]

 

 

featuresets 特征集合一共有2000个文件,每个文件是一个元组,元组包含字典(“glory”:False)和neg/pos分类

 

 

 

 

转载地址:http://bhcbx.baihongyu.com/

你可能感兴趣的文章
天地超云:一体机将成为主流的交付模式
查看>>
ssh_exchange_identification 连接出错
查看>>
链栈的初始化 入栈 出栈 打印栈中的元素等基础内容
查看>>
如何进行磁盘分区?
查看>>
你想要什么
查看>>
平民公司都有大神滴,来自一个邮箱管理员的修养
查看>>
我的友情链接
查看>>
抽取VS文件组成类GCC的编译器,并编译C程序为dll动态链接库
查看>>
01_编程语言介绍
查看>>
iOS应用审核时间注意点
查看>>
ifconfig: command not found
查看>>
jquery sample tree 禁止拖拽
查看>>
我的友情链接
查看>>
第一次尝试写IT博客
查看>>
struct字节分配问题
查看>>
Error string types not allowed at android:configChanges in manifest file
查看>>
mysql服务器主从服务器设置
查看>>
cmake-3.1.0-rc2 could not configure normally
查看>>
wordpress导入数据库报错
查看>>
使用主机屋感受
查看>>