lecture WIDM Lab Tutorial 2016 Web Intelligence and Data Mining Laboratory Python

(1)

Python _簡單教學

104522102 _蔣佳峰

(2)

大

1. Python的基本介紹

2. 安裝直譯器與pycharm

3. 簡單的東西與程式碼

4. Crawler for PTT

(3)

基本介紹

 Python是一種物件導向直譯式的電腦程式語言

 使用直譯器,執行度較Java, C++慢

 程式區塊使用縮排來界定範圍

 應用範圍:Web程式,GUI開發,作業系統等

(4)

安裝直譯器與pycharm

請進入頁載直譯器 https://www.python.org/ 滑鼠移至Downloads 載2.7.12版

(5)

安裝直譯器與pycharm

環境變數設定:

Win7:系統內容->進階->環境變數->點擊path選擇編輯->在最面新增c:\python27\ Win8&10:直接搜尋環境變數_… 驟 Win7

(6)

安裝直譯器與pycharm

進入頁:

https://www.jetbrains.com/pycharm/

點擊Download選擇Windows版,Community與Professsional 擇一載按照指示點擊完成安裝

(7)

Hello World !

• 單引號與雙引號在

python一樣都可用來

表示字串

• print 面可不用帶括

號

(8)

input&raw_input

input(): 接輸入的內容並進行運算,若輸入的內容不是字串數值變數或

者boolean，則成系統錯誤

raw_input(): 將輸入的內容一視為字串

實際的input是樣:

def input(prompt):

return eval(raw_input(prompt))

(9)

簡單的程式碼

a=5

while(True):

pri t i put E ter:

B被認為是變數稱

但是沒有被宣告而

成錯誤

(10)

List

List是一個可被改變的有序性型態

假設s, t為兩List , x為一個變數,則可使用指 : x in s : x是否存在List s中

x not in s : x是否不存在List s中 s=s+t :將s與t連接起來存至List s中 len(s) :回傳List s的元素個數

s.append(x) :將x加入至List s的最面

s.pop(i): 將List s中第i個元素去除,預設為第一個 s[-1]:List s最一個元素

in 為 Python 的關鍵字

(keyword) 之一，用來斷複合資料型態 (compound data type) 之中否有某個元素 (element) ，也就是在可包含其他物件 (object) 的物件之中斷是否有某個物件

(11)

簡單的程式碼

a=[1,2,3,4,5,6,7,8,9,10] print a

print a[0] print a[-1] print a[-2] print len(a) print a[0:9] print a[0:10:2] a.append(11) print a

a.pop(0) print a

(12)

簡單的程式碼

a=range(0,5) i=len(a)

print a

for x in range(5): print x,

while(i>0): print a[i-1] i=i-1

(13)

函數

def 函數稱(參數1,參數 _{,… :}

…

return _…

如果函式執行完畢但沒有使用return傳回值，則傳回None

(14)

簡單的程式碼

def bigger(a,b): if(a>b):

pri t a is bigger tha b ! elif(a<b):

print b is bigger than a ! else:

print The are the sa e !

(15)

Read & Write file

使用open(檔 ,讀模式)

E : a=ope .t t , r context=a.read() 讀檔分為種:

read:一次讀整份文件 readline:一次只讀一行

readlines:跟read類似,但是將整份文章內容一行行轉成list的一個元素寫檔分為兩種

write

writelines:類似於readlines,將list裡的字串寫入檔案中

(16)

Read & Write file

with open('D:\\abc.txt','w') as ptr:

ptr.write('Oh ! I am so handsomeeeeeeeee !\n') ptr.write('It was just a dream !')

ptr.close()

with open('D:\\JohnCena.txt','r') as ptr2: while(True):

context=ptr2.readline() if(context==""):

break else:

print context, ptr.close()

(17)

Class

類 (class) 用來設計自己需要的物件 (object) ，類是

物件的模板 Python 中設計類使用關鍵字 class ，裡

頭可定義類的類屬性 (class attribute) 實體屬性

(instance attribute) 與方法 (method)

(18)

class vector():

def init(self,a=0,b=0): Constructor for initialization

self.a=a

self.b=b

def plus(self,y): method

self.a=self.a+y.a

self.b=self.b+y.b

def vector_print(self):method

print ('<%d,%d>')%(self.a,self.b)

(19)

x=vector()

x.vector_print() a=vector(5,4) a.vector_print() b=vector(5,6) a.plus(b)

a.vector_print()

(20)

urllib2

Urllib2用於開啟URLs,送出request

典型的應用包括從頁獲內容自動化頁爬蟲

(21)

開啟一個頁

import urllib2

web=urllib2.urlopen('http://www.ncu.edu.tw/').read()

print type(web)

print web

(22)

(23)

BeautifulSoup

它可拿來parsing html的內容，並擷你想要的tag 及content資料安裝驟:

File->Settings->Project->Project Interpreter->install(green cross)->E ter bs ->install package 參考站:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

(24)

import bs4

import urllib2

web=urllib2.urlopen('http://www.ncu.edu.tw/

').read()

print type(web)

result=bs4.BeautifulSoup(web,"html.parser")

print type(result)

print 'title:'+result.title.string

print 'link:'

for link in result.find_all('a'):

print type(link)

print(link.get('href'))

(25)

import bs4 import urllib2 import re

web=urllib2.urlopen(https://tw.news.yahoo.com/%E8%A8%B1%E5%A4%9A%E5%9C%8B%E5%A4% 96-pokemon-%E7%8E%A9%E5%AE%B6%E6%8A%B1%E6%80%A8-pokemon-

%E6%98%AF%E5%90%83%E6%B5%81%E9%87%8F%E6%80%AA%E7%8D%B8- 041600115.html').read()

p=re.compile(r'<.*?>')

result=bs4.BeautifulSoup(web,"html.parser") print 'title:'+result.title.string

for link in result.find_all('p'): context=str(link)

print p.sub('',context)

(26)

(27)

(28)

Regular Expression

def remove_html_tags(self, data): (把HTML Tag去除)

**p = re.compile(r'<.*?>')**

return p.sub('', data) (把regular expression找到的東西全部去掉)

需要import re

re.compile(pattern): 建立規則用來辨識長得像pattern的東西

例如:

re.compile ab+ 用來辨識ab,abb,abbb…

(29)

爬蟲(Crawler)

 是一種利用HTTP Request 抓路資料的技術

 能自動化瀏覽路的程式，有效率得/更新站內容

 常利用頁結構達到大規模收集頁面資料

(30)

(31)

Crawler PTT版

實作方式乃利用PTT 頁板文章顯示與頁面的結構化，達到簡單抓文章的目的

對於任何看板的主頁面，其址長得像樣:

https://www.ptt.cc/bbs/看板稱/index總頁數.html Ex: https://www.ptt.cc/bbs/Gossiping/index100.html

對於文章內容則是長成樣:

https://www.ptt.cc/bbs/看板稱/文章ID.html

Ex: https://www.ptt.cc/bbs/Gossiping/M.1119233779.A.191.html

(32)

陽春的PTT Crawler

https://github.com/paulyang0125/bbs_crawler_utility

(33)

讀看板文章列表

if (self.useHeader):

request = urllib2.Request(page_url,headers=headers)

page = bs4.BeautifulSoup(urllib2.urlopen(request).read())

else:

page = bs4.BeautifulSoup(urllib2.urlopen(page_url).read())

(34)

對於頁面每一篇文章都進行一次走訪

for link in page.find_all(class_= r-ent ):

post_id = link.a.get( href ).split( / )[-1][:-5] if (self.useHeader):

request = urllib2.Request(post_url(post_id), headers=headers) post = bs4.BeautifulSoup(urllib2.urlopen(request).read())

else:

post = bs4.BeautifulSoup(urllib2.urlopen(post_url(post_id)).read()) with open(post_id+ .txt , w ) as contentFile_fp:

contentFile_fp.write(

Title: + post.title.string.encode( utf-8 ) + \n + \n )

contentFile_fp.write(self.remove_html_tags(str(post.find(id= main-container )) contentFile_fp.close()

lecture WIDM Lab Tutorial 2016 Web Intelligence and Data Mining Laboratory Python

Python 簡單教學

104522102 蔣佳峰

大

1. Python的基本介紹

2. 安裝直譯器與pycharm

3. 簡單的東西與程式碼

4. Crawler for PTT

基本介紹

 Python是一種物件導向 直譯式的電腦程式語言

 使用直譯器,執行 度較Java, C++慢

 程式區塊使用縮排來界定範圍

 應用範圍:Web程式,GUI開發,作業系統等

安裝直譯器與pycharm

安裝直譯器與pycharm

安裝直譯器與pycharm

Hello World !

• 單引號與雙引號在

python一樣都可 用來

表示字串

• print 面可 不用帶括

號

input&raw_input

input(): 接 輸入的內容並進行運算,若輸入的內容不是字串 數值 變數或

者boolean，則 成系統錯誤

raw_input(): 將輸入的內容一 視為字串

實際 的input是 樣:

def input(prompt):

return eval(raw_input(prompt))

簡單的程式碼

a=5

while(True):

pri t i put E ter:

B被認為是變數 稱

但是沒有被宣告而

成錯誤

List

簡單的程式碼

簡單的程式碼

函數

簡單的程式碼

Read & Write file

Read & Write file

Class

類 (class) 用來設計自己需要的物件 (object) ，類 是

物件的模板 Python 中設計類 使用關鍵字 class ，裡

頭可定義類 的類 屬性 (class attribute) 實體屬性

(instance attribute) 與方法 (method)

class vector():

def __init__(self,a=0,b=0): Constructor for initialization

self.a=a

self.b=b

def plus(self,y): method

self.a=self.a+y.a

self.b=self.b+y.b

def vector_print(self):method

print ('<%d,%d>')%(self.a,self.b)

urllib2

Urllib2用於開啟URLs,送出request

典型的應用包括從 頁獲 內容 自動化 頁爬蟲

開啟一個 頁

import urllib2

web=urllib2.urlopen('http://www.ncu.edu.tw/').read()

print type(web)

print web

BeautifulSoup

import bs4

import urllib2

web=urllib2.urlopen('http://www.ncu.edu.tw/

').read()

print type(web)

result=bs4.BeautifulSoup(web,"html.parser")

print type(result)

print 'title:'+result.title.string

print 'link:'

for link in result.find_all('a'):

print type(link)

print(link.get('href'))

Regular Expression

def remove_html_tags(self, data): (把HTML Tag去除)

Python _簡單教學

104522102 _蔣佳峰

 Python是一種物件導向直譯式的電腦程式語言

 使用直譯器,執行度較Java, C++慢

python一樣都可用來

• print 面可不用帶括

input(): 接輸入的內容並進行運算,若輸入的內容不是字串數值變數或

者boolean，則成系統錯誤

raw_input(): 將輸入的內容一視為字串

實際的input是樣:

B被認為是變數稱

類 (class) 用來設計自己需要的物件 (object) ，類是

物件的模板 Python 中設計類使用關鍵字 class ，裡

頭可定義類的類屬性 (class attribute) 實體屬性

def init(self,a=0,b=0): Constructor for initialization

典型的應用包括從頁獲內容自動化頁爬蟲

開啟一個頁

**p = re.compile(r'<.*?>')**

 是一種利用HTTP Request 抓路資料的技術

 能自動化瀏覽路的程式，有效率得/更新站內容

 常利用頁結構達到大規模收集頁面資料

讀看板文章列表