一 : Python学习2

python中有三种内部数据结构:列表、元组、字典(using_list.py,using_tuple.py, using——dict.py)
list:列表。[www.61k.com)shoplist = ['apple', 'mango', 'carrot', 'banana']
方法,尾部添加shoplist.append('rice'),排序shoplist.sort(),删除del shoplist[i]

1 # Filename: using_list.py
2 # This is my shopping list
3 shoplist = ['apple', 'mango', 'carrot', 'banana']
4 print 'i have', len(shoplist), 'items to choose'
5 print 'these items are:',
6 for item in shoplist:
7 print item,
8 print '\nI also have to buy rice'
9 shoplist.append('rice')
10 print 'my shoplist now is', shoplist
11 print 'i will sort my list now'
12 shoplist.sort()
13 print 'Sorted shopping list is', shoplist
14 print 'The first item i will buy is', shoplist[0]
15 olditem = shoplist[0]
16 del shoplist[0]
17 print 'i bought the', olditem
18 print 'My shopping list is now', shoplist

i have 4 items to choose
these items are: apple mango carrot banana
I also have to buy rice
my shoplist now is ['apple', 'mango', 'carrot', 'banana', 'rice']
i will sort my list now
Sorted shopping list is ['apple', 'banana', 'carrot', 'mango', 'rice']
The first item i will buy is apple
i bought the apple
My shopping list is now ['banana', 'carrot', 'mango', 'rice']

zoo = ('wolf', 'elephant', 'penguin') new_zoo = ('monkey', 'dolphin', zoo)

元组在打印语句中应用:name = 'Mike' age = 22 print '%s is years old ' %(name, age)

1 # Filename: using_tuple.py
2 zoo = ('wolf', 'elephant', 'penguin')
3 print 'number of animals in the zoo is', len(zoo)
4 new_zoo = ('monkey', 'dolphin', zoo)
5 print 'number of animals in the new zoo is', len(new_zoo)
6 print 'all animals in the new zoo are', new_zoo
7 print 'Animals brought from old zoo are', new_zoo[2]
8 print 'last animal brought from old zoo is', new_zoo[2][2]

number of animals in the zoo is 3
number of animals in the new zoo is 3
all animals in the new zoo are ('monkey', 'dolphin', ('wolf', 'elephant', 'penguin'))
Animals brought from old zoo are ('wolf', 'elephant', 'penguin')
last animal brought from old zoo is penguin

dict:字典。由键值对构成,键不可改变,值可修改。 d = {key1 : value1, key2 : value2}

1 # Filename: using_dict.py
2 ab = {'Swaroop' : 'swaroopch@byteofpython.info',
3 'Larry' : 'larry@wall.org',
4 'Matsumoto' : 'matz@ruby-lang.org',
5 'Spammer' : 'spammer@hotmail.com'}
6 print "Swaroop's address is %s" %ab['Swaroop']
8 # Adding a key/value pair
9 ab['Guido'] = 'guido@python.org'
10 print '\nThere are %d contacts in the address-book\n' %len(ab)
11 for name, address in ab.items():
12 print 'Contact %s at %s' %(name, address)
13 if 'Guido' in ab: # ab.has_key('Guido')
14 print "\nGuido's address is %s" %ab['Guido']

Swaroop's address is swaroopch@byteofpython.info

There are 5 contacts in the address-book

Contact Swaroop at swaroopch@byteofpython.info
Contact Matsumoto at matz@ruby-lang.org
Contact Larry at larry@wall.org
Contact Spammer at spammer@hotmail.com
Contact Guido at guido@python.org

Guido's address is guido@python.org


1 # FIlename: seq.py
2 shoplist = ['apple', 'mango', 'carrot', 'banana']
3 # Indexing or Subscriprion operation
4 print 'Item 0 is', shoplist[0]
5 print 'Item 1 is', shoplist[1]
6 print 'Item 2 is', shoplist[2]
7 print 'Item 3 is', shoplist[3]
8 print 'Item -1 is', shoplist[-1]
9 print 'Item -2 is', shoplist[-2]
11 # slicing on a list
12 print 'Item 1 to 3 is', shoplist[1:3]
13 print 'Item 2 to end is', shoplist[2:]
14 print 'Item 1 to -1 is', shoplist[1:-1]
15 print 'Item start to end is', shoplist[:]
17 # slicing on a string
18 name = 'swaroop'
19 print 'characters 1 to 3 is', name[1:3]
20 print 'characters 2 to end is', name[2:]
21 print 'characters 1 to -1 is', name[1:-1]
22 print 'characters start to end is', name[:]

Item 0 is apple
Item 1 is mango
Item 2 is carrot
Item 3 is banana
Item -1 is banana
Item -2 is carrot

Item 1 to 3 is ['mango', 'carrot']
Item 2 to end is ['carrot', 'banana']
Item 1 to -1 is ['mango', 'carrot']
Item start to end is ['apple', 'mango', 'carrot', 'banana']
characters 1 to 3 is wa
characters 2 to end is aroop
characters 1 to -1 is waroo
characters start to end is swaroop

mylist = shoplist, 引用
mylist = shoplist[:], 完全复制

1 # Filename: reference.py
2 print 'Simple Assignment'
3 shoplist = ['apple', 'mango', 'carrot', 'banana']
4 # mylist is just another name pointed to the same object!
5 mylist = shoplist
6 del shoplist[0]
7 print 'shoplist is', shoplist
8 print 'mylist is', mylist
9 # make a copy by doing a full slice
10 mylist = shoplist[:]
11 del mylist[0]
12 print 'shoplist is', shoplist
13 print 'mylst is', mylist

Simple Assignment
shoplist is ['mango', 'carrot', 'banana']
mylist is ['mango', 'carrot', 'banana']
shoplist is ['mango', 'carrot', 'banana']
mylst is ['carrot', 'banana']


1 # Filename: str_methods.py
2 name = 'Swaroop'
3 if name.startswith('Swa'):
4 print 'Yes, the string strats with "Swa"'
5 if 'a' in name:
6 print 'Yes, it contains the string "a"'
7 if name.find('war') != -1:
8 print 'Yes, it contains the string "war"'
9 delimiter = '_*_'
10 mylist = ['Brazil', 'Russia', 'India', 'China']
11 print delimiter.join(mylist)

Yes, the string strats with "Swa"
Yes, it contains the string "a"
Yes, it contains the string "war"

二 : Python自然语言处理学习笔记(61):7.2 分块

7.2   Chunking 分块

The basic technique we will use for entity detection ischunking, which segments and labels multi-token sequences as illustrated inFigure 7.2. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called achunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens(标记的子集). Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

barked Python自然语言处理学习笔记(61):7.2 分块

  Figure 7.2: Segmentation and Labeling at both the Token and Chunk Levels

In this section, we will explore chunking in some depth, beginning with the definition and representation of chunks. We will see regular expression and n-gram approaches to chunking, and will develop and evaluate chunkers using the CoNLL-2000 chunking corpus. We will then return inSection (5)andSection 7.6to the tasks of named entity recognition and relation extraction.

Noun Phrase Chunking 名词短语分块

We will begin by considering the task ofnoun phrase chunking, orNP-chunking, where we search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets:


[ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.

As we can see, NP-chunks are often smaller pieces than complete noun phrases. For example, the market for system-management software for Digital's hardware is a single noun phrase (containing two nested noun phrases), but it is captured in NP-chunks by the simpler chunk the market. One of the motivations for this difference is that NP-chunks are defined so as not to contain other NP-chunks. Consequently, any prepositional phrases(介词短语) or subordinate clauses(从句) that modify a nominal(名词性词) will not be included in the corresponding NP-chunk, since they almost certainly contain further noun phrases.

One of the most useful sources of information for NP-chunking is part-of-speech tags. This is one of the motivations for performing part-of-speech tagging in our information extraction system. We demonstrate this approach using an example sentence that has been part-of-speech tagged inExample 7.3. In order to create an NP-chunker, we will first define achunk grammar(分块语法), consisting of rules that indicate how sentences should be chunked. In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Using this grammar, we create a chunk parser, and test it on our example sentence. The result is a tree, which we can either print, or display graphically.

>>>sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),

...("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

>>>grammar = "NP: {<DT>?<JJ>*<NN>}"

>>>cp = nltk.RegexpParser(grammar)

>>>result = cp.parse(sentence)



 (NP the/DT little/JJ yellow/JJ dog/NN)



 (NP the/DT cat/NN))


Example 7.3 (code_chunkex.py)

barked Python自然语言处理学习笔记(61):7.2 分块

 Figure 7.3: Example of a Simple Regular Expression Based NP Chunker.

Tag Patterns标签

The rules that make up a chunk grammar usetag patternsto describe sequences of tagged words. A tag pattern is a sequence of part-of-speech tags delimited using angle brackets, e.g. <DT>?<JJ>*<NN>. Tag patterns are similar to regular expression patterns (Section 3.4). Now, consider the following noun phrases from the Wall Street Journal:

扩展:python自然语言处理 / python 自然语言 / python自然语言处理包

another/DT sharp/JJ dive/NN

trade/NN figures/NNS

any/DT new/JJ policy/NN measures/NNS

earlier/JJR stages/NNS

Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP

We can match these noun phrases using a slight refinement of the first tag pattern above, i.e. <DT>?<JJ.*>*<NN.*>+. This will chunk any sequence of tokens beginning with an optional determiner, followed by zero or more adjectives of any type (including relative adjectives like earlier/JJR), followed by one or more nouns of any type. However, it is easy to find many more complicated examples which this rule will not cover:

his/PRP$ Mansion/NNP House/NNP speech/NN

the/DT price/NN cutting/VBG

3/CD %/NN to/TO 4/CD %/NN

more/JJR than/IN 10/CD %/NN

the/DT fastest/JJS developing/VBG trends/NNS

's/POS skill/NN


Your Turn:Try to come up with tag patterns to cover these cases. Test them using the graphical interface nltk.app.chunkparser(). Continue to refine your tag patterns with the help of the feedback given by this tool.

Chunking with Regular Expressions用正则表达式分块

To find the chunk structure for a given sentence, the RegexpParser chunker begins with a flat structure in which no tokens are chunked. The chunking rules are applied in turn, successively updating the chunk structure. Once all of the rules have been invoked, the resulting chunk structure is returned.

Example 7.4shows a simple chunk grammar consisting of two rules. The first rule matches an optional determiner or possessive pronoun(所有格代名词), zero or more adjectives, then a noun. The second rule matches one or more proper nouns(专有名词). We also define an example sentence to be chunked, and run the chunker on this input.

grammar = r"""

NP: {<DT|PP\$>?<JJ>*<NN>} #chunk determiner/possessive, adjectives and nouns

     {<NNP>+}               # chunk sequences of proper nouns


cp = nltk.RegexpParser(grammar)

sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),

                ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]



 (NP Rapunzel/NNP)



 (NP her/PP$ long/JJ golden/JJ hair/NN))

Example 7.4 (code_chunker1.py):Figure 7.4: Simple Noun Phrase Chunker


The $ symbol is a special character in regular expressions, and must be backslash escaped in order to match the tag PP$. 在正则表达式中$是一个特殊符号,必须使用转义符\来匹配PP$标志。(www.61k.com]

If a tag pattern matches at overlapping locations, the leftmost match takes precedence(最左边匹配拥有优先). For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked:

>>>nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]

>>>grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns"

>>>cp = nltk.RegexpParser(grammar)


(S (NP money/NN market/NN) fund/NN)

Once we have created the chunk for money market, we have removed the context that would have permitted fund to be included in a chunk. This issue would have been avoided with a more permissive(宽容的) chunk rule, e.g. NP: {<NN>+}.


We have added a comment to each of our chunk rules. These are optional; when they are present, the chunker prints these comments as part of its tracing output.

Exploring Text Corpora探索文本语料库

InSection 5.2we saw how we could interrogate(询问) a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows:

>>>cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')

>>>brown = nltk.corpus.brown

>>>forsent in brown.tagged_sents():

...    tree = cp.parse(sent)

...    for subtree in tree.subtrees():

...        if subtree.node == 'CHUNK': print subtree


(CHUNK combined/VBN to/TO achieve/VB)

(CHUNK continue/VB to/TO place/VB)

(CHUNK serve/VB to/TO protect/VB)

(CHUNK wanted/VBD to/TO wait/VB)

(CHUNK allowed/VBN to/TO place/VB)

(CHUNK expected/VBN to/TO become/VB)


(CHUNK seems/VBZ to/TO overtake/VB)

(CHUNK want/VB to/TO buy/VB)


Your Turn:Encapsulate(封装) the above example inside a function find_chunks() that takes a chunk string like "CHUNK: {<V.*> <TO> <V.*>}" as an argument. Use it to search the corpus for several other patterns, such as four or more nouns in a row, e.g. "NOUNS: {<N.*>{4,}}"


Sometimes it is easier to define what we want to exclude from a chunk. We can define achinkto be a sequence of tokens that is not included in a chunk. In the following example, barked/VBD at/IN is a chink:

[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]

Chinking is the process of removing a sequence of tokens from a chunk. If the matching sequence of tokens spans(贯穿) an entire chunk, then the whole chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before. If the sequence is at the periphery(外围) of the chunk, these tokens are removed, and a smaller chunk remains. These three possibilities are illustrated inTable 7.3.

` `

Entire chunk

Middle of a chunk

End of a chunk


[a/DT little/JJ dog/NN]

[a/DT little/JJ dog/NN]

[a/DT little/JJ dog/NN]


Chink "DT JJ NN"

Chink "JJ"

Chink "NN"






a/DT little/JJ dog/NN

[a/DT] little/JJ [dog/NN]

[a/DT little/JJ] dog/NN

InExample 7.5, we put the entire sentence into a single chunk, then excise the chinks.

grammar = r"""


   {<.*>+}         # Chunk everything

   }<VBD|IN>+{     # Chink sequences of VBD and IN


sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),

      ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

cp = nltk.RegexpParser(grammar)



 (NP the/DT little/JJ yellow/JJ dog/NN)



 (NP the/DT cat/NN))

Example 7.5 (code_chinker.py):Figure 7.5: Simple Chinker

Representing Chunks: Tags vs Trees 表示块:标签与树

As befits their intermediate(中间的) status between tagging and parsing (Chapter 8), chunk structures can be represented using either tags or trees. The most widespread file representation usesIOB tags(IOB标记). In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O. An example of this scheme is shown inFigure 7.6.

barked Python自然语言处理学习笔记(61):7.2 分块

Figure 7.6: Tag Representation of Chunk Structures

IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format. Here is how the information inFigure 7.6would appear in a file:


saw VBD O

the DT B-NP

little JJ I-NP

yellow JJ I-NP

dog NN I-NP

In this representation there is one token per line, each with its part-of-speech tag and chunk tag. This format permits us to represent more than one chunk type, so long as the chunks do not overlap. As we saw earlier, chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent(组成) that can be manipulated directly. An example is shown inFigure 7.7.

barked Python自然语言处理学习笔记(61):7.2 分块

Figure 7.7: Tree Representation of Chunk Structures


NLTK uses trees for its internal representation(内部表示)of chunks, but provides methods for reading and writing such trees to the IOB format.

三 : Python自然语言处理学习笔记(9):2.1 访问文本语料库


Updated 1st 2011.8.6cspan Python自然语言处理学习笔记(9):2.1 访问文本语料库


Accessing Text Corpora and Lexical Resources


Practical work in Natural Language Processing typically uses large bodies of linguistic data, orcorpora. The goal of this chapter is to answer the following questions:

1. What are some useful text corpora and lexical resources, and how can we access them with Python?


2. Which Python constructs are most helpful for this work?


3. How do we avoid repeating ourselves when writing Python code?


This chapter continues to present programming concepts by example, in the context of a linguistic processing task. We will wait until later before exploring each Python construct systematically. Don’t worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and—if you’re game(勇敢的)—modify it by substituting(代替) some part of the code with a different text or word. This way you will associate(联系) a task with a programming idiom, and learn the hows and whys later.

2.1 Accessing Text Corpora访问文本语料库 

As just mentioned, a text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres. We examined some small text collections in Chapter 1, such as the speeches known as the US Presidential Inaugural Addresses. This particular corpus actually contains dozens of individual texts—one per address—but for convenience we glued them end-to-end and treated them as a single text. Chapter 1 also used various predefined texts that we accessed by typing from book import *. However, since we want to be able to work with other texts, this section examines a variety of text corpora. We’ll see how to select individual texts, and how to work with them.

Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive(古腾堡电子文本存档), which contains some 25,000(现在是36,000了) free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

 >>> import nltk

 >>> nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt','chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Let’s pick out the first of these texts—Emma by Jane Austen—and give it a short name, emma, then find out how many words it contains:

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')

>>> len(emma)


In Section 1.1, we showed how you could carry out concordancing of a text such as text1 with the command text1.concordance(). However, this assumes that you are using one of the nine texts obtained as a result of doing from nltk.book import *. Now that you have started examining data from nltk.corpus, as in the previous example, you have to employ the following pair of statements to perform concordancing and other tasks from Section 1.1:

>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

 >>> emma.concordance("surprize")

When we defined emma, we invoked the words() function of the gutenberg object in NLTK’s corpus package. But since it is cumbersome(累赘的) to type such long names all the time, Python provides another version of the import statement, as follows:

>>> from nltk.corpus import gutenberg

>>> gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]

>>> emma = gutenberg.words('austen-emma.txt')

Let’s write a short program to display other information about each text, by looping over all the values of fileid(文件标识) corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. For a compact output display, we will make sure that the numbers are all integers, using int().

扩展:python 中文语料库 / python 语料库 / 文本分类语料库

>>> for fileid in gutenberg.fileids():

...     num_chars = len(gutenberg.raw(fileid)) ①

...     num_words = len(gutenberg.words(fileid))

...     num_sents = len(gutenberg.sents(fileid))

...     num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))

...           print int(num_chars/num_words), int(num_words/num_sents),     int(num_words/num_vocab), fileid


4 21 26 austen-emma.txt

4 23 16 austen-persuasion.txt

4 24 22 austen-sense.txt

4 33 79 bible-kjv.txt

4 18 5 blake-poems.txt

4 17 14 bryant-stories.txt

4 17 12 burgess-busterbrown.txt

4 16 12 carroll-alice.txt

4 17 11 chesterton-ball.txt

4 19 11 chesterton-brown.txt

4 16 10 chesterton-thursday.txt

4 18 24 edgeworth-parents.txt

4 24 15 melville-moby_dick.txt

4 52 10 milton-paradise.txt

4 12 8 shakespeare-caesar.txt

4 13 7 shakespeare-hamlet.txt

4 13 6 shakespeare-macbeth.txt

4 35 12 whitman-leaves.txt

This program displays three statistics for each text: average word length平均字长, average sentence length平均句长, and the number of times each vocabulary item appears in the text on average本文中每个词汇平均出现数量 (our lexical diversity score我们的词汇多样性得分). Observe that average word length appears to be a general property of English, since it has a recurrent(周期性的) value of 4. (In fact, the average word length is really 3, not 4, since the num_chars variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

The previous example also showed how we can access the “raw” text of the book①, not split up into tokens. The raw() function gives us the contents of the file without any linguistic processing(对文件的内容不进行任何语言处理). So, for example, len(gutenberg.raw('blake-poems.txt') tells us how many letters occur in the text, including the spaces between words. The sents() function divides the text up into its sentences, where each sentence is a list of words(把文本分割成句子,每个句子是一个由单词组成的列表):


>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

>>> macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

>>> macbeth_sentences[1037]

['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']

>>> longest_len = max([len(s) for s in macbeth_sentences])

>>> [s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', ...], ...]

Most NLTK corpus readers include a variety of access methods apart from words(), raw(), and sents(). Richer linguistic content is available from some corpora, such as part-of-speech tags, dialogue tags, syntactic trees, and so forth; we will see these in later chapters.

Web and Chat Text  Web和聊天文本 

Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations overheard(无意听到的) in New York, the movie script of Pirates of the Carribean(加勒比海盗), personal advertisements, and wine reviews:

>>> from nltk.corpus import webtext

>>> for fileid in webtext.fileids():

...     print fileid, webtext.raw(fileid)[:65], '...'

扩展:python 中文语料库 / python 语料库 / 文本分类语料库


firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...

grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there!  [clop...

overheard.txt White guy: So, do you have any plans for this evening? Asian girl...

pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr...

singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun...

wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb...

There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators(捕食者).The corpus contains over 10,000 posts(帖子), anonymized by replacing usernames with generic names(通用名) of the form “UserNNN”, and manually edited to remove any other identifying information. The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom). The filename contains the date, chat-room, and number of posts; e.g., 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on 10/19/2006.

>>> from nltk.corpus import nps_chat

>>> chatroom = nps_chat.posts('10-19-20s_706posts.xml')

>>> chatroom[123]

['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',','I', 'can', 'look', 'in', 'a', 'mirror','.']

Brown Corpus 布朗语料库 

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an example of each genre (for a complete list, see).

A16ca16newsChicago Tribune: Society Reportage
B02cb02editorialChristian Science Monitor: Editorials
C17cc17reviewsTime Magazine: Reviews
D12cd12religionUnderwood: Probing the Ethics of Realtors
E36ce36hobbiesNorling: Renting a Car in Europe
F25cf25loreBoroff: Jewish Teenage Culture
G22cg22belles_lettresReiner: Coping with Runaway Technology
H15ch15governmentUS Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17cj19learnedMosteller: Probability with Statistical Applications
K04ck04fictionW.E.B. Du Bois: Worlds of Color
L13cl13mysteryHitchens: Footsteps in the Night
M01cm01science_fictionHeinlein: Stranger in a Strange Land
N14cn15adventureField: Rattlesnake Ridge
P12cp12romanceCallaghan: A Passion in Rome
R06cr06humorThurber: The Future, If Any, of Comedy

               Table 2-1. Example document for each section of the Brown Corpus

We can access the corpus as a list of words or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

>>> from nltk.corpus import brown

>>> brown.categories()

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',

'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',

扩展:python 中文语料库 / python 语料库 / 文本分类语料库


>>> brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

>>> brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]

>>> brown.sents(categories=['news', 'editorial', 'reviews'])

[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry(语言学研究) known as stylistics(文体学). Let’s compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Remember to import nltk before doing the following:

>>> from nltk.corpus import brown

>>> news_text = brown.words(categories='news')

>>> fdist = nltk.FreqDist([w.lower() for w in news_text])

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> for m in modals:

...     print m + ':', fdist[m],


can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

cspan Python自然语言处理学习笔记(9):2.1 访问文本语料库Your Turn: Choose a different section of the Brown Corpus, and adapt the preceding example to count a selection of wh words, such as what, when, where, who and why.

Next, we need to obtain counts for each genre of interest. We’ll use NLTK’s support for conditional frequency distributions. These are presented systematically in Section 2.2, where we also unpick(拆散) the following code line by line. For the moment, you can ignore the details and just concentrate on the output(忽略细节,专注于结果).

>>> cfd = nltk.ConditionalFreqDist(

...           (genre, word)

...           for genre in brown.categories()

...           for word in brown.words(categories=genre))

>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> cfd.tabulate(conditions=genres, samples=modals)

                 can  could  may might must will

           news   93   86   66   38   50  389

      religion   82   59   78   12   54   71

       hobbies  268   58  131   22   83  264

 science_fiction 16   49    4   12    8   16

      romance   74  193   11   51   45   43

        humor   16   30    8    8    9   13

Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could. Would you have predicted this? The idea that word counts might distinguish(区分) genres will be taken up(采纳) again in Chapter 6.

Reuters Corpus 路透社语料库 

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called “training” and “test”(训练和测试); thus, the text with fileid 'test/14826' is a document drawn from the test set. This split(分割) is for training and testing algorithms that automatically detect the topic of a document, as we will see in Chapter 6.

扩展:python 中文语料库 / python 语料库 / 文本分类语料库

>>> from nltk.corpus import reuters

>>> reuters.fileids()

['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]

>>> reuters.categories()

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa',

'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn',

'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]

Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other(相互覆盖,也就是内容有重复), simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single fileid or a list of fileids.

>>> reuters.categories('training/9865')

['barley', 'corn', 'grain', 'wheat']

>>> reuters.categories(['training/9865', 'training/9880'])

['barley', 'corn', 'grain', 'money-fx', 'wheat']

>>> reuters.fileids('barley')

['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]

>>> reuters.fileids(['barley', 'corn'])

['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106',

'test/15287', 'test/15341', 'test/15618', 'test/15618', 'test/15648', ...]

Similarly, we can specify the words or sentences we want in terms of(按照) files or categories. The first handful(少数) of words in each of these texts are the titles, which by convention(按照惯例) are stored as uppercase.

>>> reuters.words('training/9865')[:14]


'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']

>>> reuters.words(['training/9865', 'training/9880'])


>>> reuters.words(categories='barley')


>>> reuters.words(categories=['barley', 'corn'])

['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

Inaugural Address Corpus 就职演说语料库 

In Section 1.1, we looked at the Inaugural Address Corpus, but treated it as a single text. The graph in Figure 1-2 used “word offset”(单词位移) as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension(时间维度,奥巴马的也有):

>>> from nltk.corpus import inaugural

>>> inaugural.fileids()   

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]

>>> [fileid[:4] for fileid in inaugural.fileids()]

['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]

Notice that the year of each text appears in its filename. To get the year out of the filename, we extracted the first four characters, using fileid[:4].

Let’s look at how the words America and citizen are used over time. The following code converts the words in the Inaugural corpus to lowercase using w.lower()①, then checks whether they start with either of the “targets” america or citizen using startswith()①. Thus it will count words such as American’s and Citizens. We’ll learn about conditional frequency distributions in Section 2.2; for now, just consider the output, shown in Figure 2-1.

>>> cfd = nltk.ConditionalFreqDist(

...           (target, file[:4])

...           for fileid in inaugural.fileids()

...           for w in inaugural.words(fileid)

扩展:python 中文语料库 / python 语料库 / 文本分类语料库

...           for target in ['america', 'citizen']

...           if w.lower().startswith(target)) ①

>>> cfd.plot()


Traceback (most recent call last):

  File "E:/Test/NLTK/2.1.py", line 6, in <module>

    for fileid in inaugural.fileids()

  File "C:\Python26\lib\site-packages\nltk\probability.py", line 1740, in __init__

    for (cond, sample) in cond_samples:

  File "E:/Test/NLTK/2.1.py", line 9, in <genexpr>

    if w.lower().startswith(target))

TypeError: 'type' object is unsubscriptable

Figure 2-1. Plot of a conditional frequency distribution: All words in the Inaugural Address Corpus that begin with america or citizen are counted; separate counts are kept for each address; these are plotted so that trends in usage over time can be observed; counts are not normalized for document length.

Annotated Text Corpora 注释文本语料库 

Many text corpora contain linguistic annotations, representing part-of-speech tags, named entities, syntactic structures(句法结构), semantic roles(语义角色), and so forth. NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. Table 2-2 lists some of the corpora. For information about downloading them, seehttp://www.nltk.org/data.For more examples of how to access NLTK corpora, please consult the Corpus HOWTO athttp://www.nltk.org/howto.

                                   Table 2-2. Some of the corpora and corpus samples distributed with NLTK

扩展:python 中文语料库 / python 语料库 / 文本分类语料库

Brown CorpusFrancis, Kucera15 genres, 1.15M words, tagged, categorized
CESS TreebanksCLiC-UB1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data FilesPereira & WarrenWorld Geographic Database
CMU Pronouncing DictionaryCMU127k entries
CoNLL 2000 Chunking DataCoNLL270k words, tagged and chunked
CoNLL 2002 Named EntityCoNLL700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)CoNLL150k words, dependency parsed (Basque, Catalan)
Dependency TreebankNaradDependency parsed version of Penn Treebank sample
Floresta TreebankDiana Santos et al9k sentences, tagged and parsed (Portuguese)
Gazetteer ListsVariousLists of cities and countries
Genesis CorpusMisc web sources6 texts, 200k words, 6 languages
Gutenberg (selections)Hart, Newby, et al18 texts, 2M words
Inaugural Address CorpusCSpanUS Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged CorpusKumaran et al60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho CorpusNILC, USP, Brazil1M words, tagged (Brazilian Portuguese)
Movie ReviewsPang, Lee2k movie reviews with sentiment polarity classification
Names CorpusKantrowitz, Ross8k male and female names
NIST 1999 Info Extr (selections)Garofolo63k words, newswire and named-entity SGML markup
NPS Chat CorpusForsyth, Martell10k IM chat posts, POS-tagged and dialogue-act tagged
PP Attachment CorpusRatnaparkhi28k prepositional phrases, tagged as noun or verb modifiers
Proposition BankPalmer113k propositions, 3300 verb frames
Question ClassificationLi, Roth6k questions, categorized
Reuters CorpusReuters1.3M words, 10k news documents, categorized
Roget's ThesaurusProject Gutenberg200k words, formatted text
RTE Textual EntailmentDagan et al8k sentence pairs, categorized
SEMCORRus, Mihalcea880k words, part-of-speech and sense tagged
Senseval 2 CorpusPedersen600k words, part-of-speech and sense tagged
Shakespeare texts (selections)Bosak8 books in XML format
State of the Union CorpusCSPAN485k words, formatted text
Stopwords CorpusPorter et al2,400 stopwords for 11 languages
Swadesh CorpusWiktionarycomparative wordlists in 24 languages
Switchboard Corpus (selections)LDC36 phonecalls, transcribed, parsed
Univ Decl of Human RightsUnited Nations480k words, 300+ languages
Penn Treebank (selections)LDC40k words, tagged and parsed
TIMIT Corpus (selections)NIST/LDCaudio files and transcripts for 16 speakers
VerbNet 2.1Palmer et al5k verbs, hierarchically organized, linked to WordNet
Wordlist CorpusOpenOffice.org et al960k words and 20k affixes for 8 languages
WordNet 3.0 (English)Miller, Fellbaum145k synonym sets

Corpora in Other Languages 其他语言的语料库

NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora (see Section 3.3).

>>> nltk.corpus.cess_esp.words()

['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]

>>> nltk.corpus.floresta.words()

['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]

>>> nltk.corpus.indian.words('hindi.pos')



\x82\xe0\xa4\xa7', ...]

>>> nltk.corpus.udhr.fileids()

['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1',

'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1',

'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...]

>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]

[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]

The last of these corpora, udhr, contains the Universal Declaration of Human Rights(国际人权宣言)in over 300 languages. The fileids for this corpus include information about the character encoding used in the file, such as UTF8 or Latin1. Let’s use a conditional frequency distribution to examine the differences in word lengths for a selection of languages included in the udhr corpus. The output is shown in Figure 2-2 (run the program yourself to see a color plot). Note that True and False are Python’s built-in Boolean values.

>>> from nltk.corpus import udhr

>>> languages = ['Chickasaw', 'English', 'German_Deutsch',

...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

>>> cfd = nltk.ConditionalFreqDist(

...           (lang, len(word))

...           for lang in languages

...           for word in udhr.words(lang + '-Latin1'))

>>> cfd.plot(cumulative=True)


Figure 2-2. Cumulative word length distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having five or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

cspan Python自然语言处理学习笔记(9):2.1 访问文本语料库Your Turn: Pick a language of interest in udhr.fileids(), and define a variable raw_text = udhr.raw(Language-Latin1). Now plot a frequency distribution of the letters of the text using nltk.FreqDist(raw_text).plot().

扩展:python 中文语料库 / python 语料库 / 文本分类语料库


Unfortunately, for many languages, substantial corpora are not yet available. Often there is insufficient(不足的) government or industrial support for developing language resources, and individual efforts are piecemeal(零碎的) and hard to discover or reuse. Some languages have no established writing system, or are endangered. (See Section 2.7 for suggestions on how to locate(查找) language resources.)

Text Corpus Structure文本语料库结构

We have seen a variety of corpus structures so far; these are summarized in Figure 2-3. The simplest kind lacks any structure: it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably(尤其) in the case of topical(时事问题) categories, as a text can be relevant to more than one topic. Occasionally, text collections have temporal structure(时态结构), news collections being the most common example. NLTK’s corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora. Table 2-3 lists functionality provided by the corpus readers.


Figure 2-3. Common structures for text corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories, such as genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).4种不同类型的语料库

Table 2-3. Basic corpus functionality defined in NLTK: More documentation can be found using help(nltk.corpus.reader) and by reading the online Corpus HOWTO athttp://www.nltk.org/howto.




The files of the corpus


The files of the corpus corresponding to these categories


The categories of the corpus


The categories of the corpus corresponding to these files


The raw content of the corpus


The raw content of the specified files


The raw content of the specified categories


The words of the whole corpus


The words of the specified fileids


The words of the specified categories


The sentences of the specified categories


The sentences of the specified fileids


The sentences of the specified categories


The location of the given file on disk


The encoding of the file (if known)


Open a stream for reading the given corpus file


The path to the root of locally installed corpus


The contents of the README file of the corpus

We illustrate the difference between some of the corpus access methods here:

>>> raw = gutenberg.raw("burgess-busterbrown.txt")

>>> raw[1:20]     #这个按单个字符算的

'The Adventures of B' 

>>> words = gutenberg.words("burgess-busterbrown.txt")

>>> words[1:20]        #这个按单个词和符号数字算的

['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.',

扩展:python 中文语料库 / python 语料库 / 文本分类语料库

'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster',


>>> sents = gutenberg.sents("burgess-busterbrown.txt")

>>> sents[1:20]  #按句子,那么这个I为啥算单独的一句?

[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as',

'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched',

'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]

Loading Your Own Corpus 装载你自己的语料库

If you have a your own collection of text files that you would like to access using the methods discussed earlier, you can easily load them with the help of NLTK’s Plain textCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict(这是Linux的吧). Whatever the location, set this to be the value of corpus_root①. The second parameter of the PlaintextCorpusReader initializer②can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see Section 3.4 for information about regular expressions).

>>> from nltk.corpus import PlaintextCorpusReader

>>> corpus_root = '/usr/share/dict' ①

>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') ②

>>> wordlists.fileids()

['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']

>>> wordlists.words('connectives')

['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

As another example, suppose you have your own local copy of Penn Treebank (release 3), in C:\corpora. We can use the BracketParseCorpusReader to access this corpus. We specify the corpus_root to be the location of the parsed Wall Street Journal component of the corpus①, and give a file_pattern that matches the files contained within its subfolders② (using forward slashes斜杠).

>>> from nltk.corpus import BracketParseCorpusReader

>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" ①

>>> file_pattern = r".*/wsj_.*\.mrg"  ②

>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)

>>> ptb.fileids()

['00/wsj_0001.mrg','00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]

>>> len(ptb.sents())


>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]

['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',

'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',

'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',

'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

扩展:python 中文语料库 / python 语料库 / 文本分类语料库

四 : opencv-python 学习笔记1:简单的图片处理

转载请注明:@小五义http://www.cnblogs.com/xiaowuyi  QQ群:64770604


1、 cv2.imread():读入图片,共两个参数,第一个参数为要读入的图片文件名,第二个参数为如何读取图片,包括cv2.IMREAD_COLOR:读入一副彩色图片;cv2.IMREAD_GRAYSCALE:以灰度模式读入图片;cv2.IMREAD_UNCHANGED:读入一幅图片,并包括其alpha通道。(www.61k.com)








单钱芳小五义 opencv-python 学习笔记1:简单的图片处理


# -*- coding: utf-8 -*- """ @xiaowuyi:http://www.cnblogs.com/xiaowuyi """ import cv2 img=cv2.imread('1.jpg',cv2.IMREAD_COLOR)# 读入彩色图片 cv2.imshow('image',img)#建立image窗口显示图片 k=cv2.waitKey(0)#无限期等待输入 if k==27:#如果输入ESC退出  cv2.destroyAllWindows()  elif k==ord('s'):#如果输入s,保存  cv2.imwrite('test.png',img)  print "OK!"  cv2.destroyAllWindows()


单钱芳小五义 opencv-python 学习笔记1:简单的图片处理


# -*- coding: utf-8 -*- """ @xiaowuyi:http://www.cnblogs.com/xiaowuyi """ import cv2 img=cv2.imread('1.jpg',cv2.IMREAD_GRAYSCALE)# 读入彩色图片 cv2.imshow('image',img)#建立image窗口显示图片 k=cv2.waitKey(0)#无限期等待输入 if k==27:#如果输入ESC退出  cv2.destroyAllWindows()  elif k==ord('s'):  cv2.imwrite('test.png',img)  print "OK!"  cv2.destroyAllWindows()


单钱芳小五义 opencv-python 学习笔记1:简单的图片处理

五 : python 系统学习笔记(三)---function








def fun(n,m,...)




(return n)





















<span># Filename: function1.py

def sayHello():

print(&#39;Hello World!&#39;) # block belonging to the function

# End of function

sayHello() # call the function

sayHello() # call the function again



C:\Users\Administrator>python D:\python\function1.py

Hello World!

Hello World!









<span># Filename: func_param.py

def printMax(a, b):

if a > b:

print(a, &#39;is maximum&#39;)

elif a == b:

print(a, &#39;is equal to&#39;, b)


print(b, &#39;is maximum&#39;)

printMax(3, 4) # directly give literal values

x = 5

y = 7

printMax(x, y) # give variables as arguments



C:\Users\Administrator>python D:\python\func_param.py

4 is maximum

7 is maximum



在第一个printMax使用中,我们直接把数,即实参,提供给函数。在第二个使用中,我们使用变量调用函数。printMax(x, y)使实参x的值赋给形参a,实参y的值赋给形参b。在两次调用中,printMax函数的工作完全相同。


当你在函数定义内声明变量的时候,它们与函数外具有相同名称的其他变量没有任何关系,即变量名称对于函数来说是 局部 的。这称为变量的作用域 。所有变量的作用域是它们被定义的块,从它们的名称被定义的那点开始。



<span># Filename: func_local.py

x = 50

def func(x):

print(&#39;x is&#39;, x)

x = 2

print(&#39;Changed local x to&#39;, x)


print(&#39;x is still&#39;, x)



C:\Users\Administrator>python D:\python\func_local.py

x is 50

Changed local x to 2

x is still 50


在函数中,我们第一次使用x的 值 的时候,Python使用函数声明的形参的值。








<span># Filename: func_global.py

x = 50

def func():

global x

print(&#39;x is&#39;, x)

x = 2

print(&#39;Changed global x to&#39;, x)


print(&#39;Value of x is&#39;, x)



C:\Users\Administrator>python D:\python\func_global.py

x is 50

Changed global x to 2

Value of x is 2



你可以使用同一个global语句指定多个全局变量。例如global x, y, z。






<span># Filename: func_nonlocal.py

def func_outer():

x = 2

print(&#39;x is&#39;, x)

def func_inner():

nonlocal x

x = 5


print(&#39;Changed local x to&#39;, x)




C:\Users\Administrator>python D:\python\func_nonlocal.py

x is 2

Changed local x to 5


当我们在func_inner()函数中的时候,在函数func_outer()内第一行定义的变量x既不是内部变量(它不在func_inner块内)也不是全局变量(它也不在主程序块内),这时我们使用nonlocal x声明我们需要使用这个变量。








<span># Filename: func_default.py

def say(message, times = 1):

print(message * times)


say(&#39;World&#39;, 5)



C:\Users\Administrator>python D:\python\func_default.py








这是因为赋给形参的值是根据位置而赋值的。例如,def func(a, b=5)是有效的,但是def func(a=5, b)是无效的。






<span># Filename: func_key.py

def func(a, b=5, c=10):

print(&#39;a is&#39;, a, &#39;and b is&#39;, b, &#39;and c is&#39;, c)

func(3, 7)

func(25, c=24)

func(c=50, a=100)</span>


C:\Users\Administrator>python D:\python\func_key.py

a is 3 and b is 7 and c is 10

a is 25 and b is 5 and c is 24

a is 100 and b is 5 and c is 50



在第一次使用函数的时候, func(3, 7),参数a得到值3,参数b得到值7,而参数c使用默认值10。

在第二次使用函数func(25, c=24)的时候,根据实参的位置变量a得到值25。根据命名,即关键参数,参数c得到值24。变量b根据默认值,为5。

在第三次使用func(c=50, a=100)的时候,我们使用关键参数来完全指定参数值。注意,尽管函数定义中,a在c之前定义,我们仍然可以在a之前指定参数c的值。





<span># Filename: total.py

def total(initial=5, *numbers, **keywords):

count = initial

for number in numbers:

count += number

for key in keywords:

count += keywords[key]

return count

print(total(10, 1, 2, 3, vegetables=50, fruits=100))



C:\Users\Administrator>python D:\python\total.py









<span># Filename: keyword_only.py

def total(initial=5, *numbers, extra_number):

count = initial

for number in numbers:

count += number

count += extra_number


total(10, 1, 2, 3, extra_number=50)

total(10, 1, 2, 3)

# Raises error because we have not supplied a default argument value for &#39;extra_number&#39;



C:\Users\Administrator>python D:\python\keyword_only.py


Traceback (most recent call last):

File "D:\python\keyword_only.py", line 11, in <module>

total(10, 1, 2, 3)

TypeError: total() needs keyword-only argument extra_number



注意这里用到的x+=y等同于x=x+y。如果你不需要星号形参只需要关键字限定形参则可以省略星号形参的参数名,如total(initial=5, *, extra_number)。





<span># Filename: func_return.py

def maximum(x, y):

if x > y:

return x

elif x == y:

return &#39;The numbers are equal&#39;


return y

print(maximum(2, 3))</span>


C:\Users\Administrator>python D:\python\func_return.py




注意,没有返回值的return语句等价于return None。None是Python中表示没有任何东西的特殊类型。例如,如果一个变量的值为None,可以表示它没有值。

除非你提供你自己的return语句,每个函数都在结尾暗含有return None语句。通过运行print someFunction(),你可以明白这一点,函数someFunction没有使用return语句,如同:


<span>def someFunction():










<span># Filename: func_doc.py

def printMax(x, y): www.2cto.com

&#39;&#39;&#39;&#39;&#39;Prints the maximum of two numbers.

The two values must be integers.&#39;&#39;&#39;

x = int(x) # convert to integers, if possible

y = int(y)

if x > y:

print(x, &#39;is maximum&#39;)


print(y, &#39;is maximum&#39;)

printMax(3, 5)



