Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
635 views
in Technique[技术] by (71.8m points)

validation - Extract city names from text using python

I have a dataset where the title of one column is "What is your location and time zone?"

This has meant that we have entries like

  1. Denmark, CET
  2. Location is Devon, England, GMT time zone
  3. Australia. Australian Eastern Standard Time. +10h UTC.

and even

  1. My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
  2. For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

Is there any way to extract the city, country and time zone from this?

I was thinking of creating an array (from an open source dataset) with all the country names (including short forms) and also city names / time zones and then if any word in the the dataset matches with a city/country/time zone or short form it fills this into a new column in the same dataset and counts it.

Is this practical?

=========== REPLT BASED ON NLTK ANSWER ============

Running same code as Alecxe I get

Traceback (most recent call last):
  File "E:SBTF
tlk_test.py", line 19, in <module>
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  File "C:Python27ArcGIS10.4libsite-packages
ltkag\__init__.py", line 110, in pos_tag
    tagger = PerceptronTagger()
  File "C:Python27ArcGIS10.4libsite-packages
ltkagperceptron.py", line 141, in __init__
    self.load(AP_MODEL_LOC)
  File "C:Python27ArcGIS10.4libsite-packages
ltkagperceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:Python27ArcGIS10.4libsite-packages
ltkdata.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:Python27ArcGIS10.4libsite-packages
ltkdata.py", line 924, in _open
    return urlopen(resource_url)
  File "C:Python27ArcGIS10.4liburllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:Python27ArcGIS10.4liburllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:Python27ArcGIS10.4liburllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:Python27ArcGIS10.4liburllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:Python27ArcGIS10.4liburllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I would use what Natural Language Processing and nltk has to offer to extract entities.

Example (heavily based on this gist) which tokenizes each line from a file, splits it into chunks and looks for NE (named entity) labels for every chunk recursively. More explanation here:

import nltk

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

with open('sample.txt', 'r') as f:
    for line in f:
        sentences = nltk.sent_tokenize(line)
        tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
        tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
        chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

        entities = []
        for tree in chunked_sentences:
            entities.extend(extract_entity_names(tree))

        print(entities)

For the sample.txt containing:

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

It prints:

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

The output is not ideal, but might be a good start for you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...