Python入門 Webスクレイピングをしてみよう

f:id:y_saiki:20171024233602j:plain

<!DOCTYPE html>

<html lang="ja">

<head>

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>

<meta content="text/css" http-equiv="Content-Style-Type"/>

<meta content="text/javascript" http-equiv="Content-Script-Type"/>

<meta content="天気予報,台風,地震,花粉,熱中症,豪雨,積雪" name="keywords"/>

・
・
省略
・
・
<!-- /天気関連ニュース -->

<div class="tracked_mods" id="ad-sqb">

<!-- SpaceID=0 robot -->

</div>

</div>

<div id="sub">

<div class="tracked_mods" id="tsunamirpt">

</div>

<div class="tracked_mods" id="earthquakerpt">

</div>

<div class="tracked_mods" id="ad-lrec">

<!-- SpaceID=0 robot -->

<div id="boxLREC">

<iframe allowtransparency="true" frameborder="0" hidefocus="true" id="tgtLREC" scrolling="no" src="about:blank" style="display:none;" tabindex="-1"></iframe>

</div>

</div>

<div class="tracked_mods" id="ad-ysp">

<!-- SpaceID=0 robot -->

</div>

<div class="yjw_sub_md_lined">

<!-- 防災情報 -->

<div class="tracked_mods" id="disaster">

<dl class="yjw_navi yjw_clr yjSt disaster">

<dt>防災情報</dt>

<dd class="list-w33"><a href="//typhoon.yahoo.co.jp/weather/jp/warn/">警報・注意報</a></dd>

<dd class="list-w33"><a href="//typhoon.yahoo.co.jp/weather/rainstorm/">大雨警戒情報</a></dd>

<dd class="list-w33"><a href="//typhoon.yahoo.co.jp/weather/jp/typhoon/">台風</a></dd>

<dd class="list-w33"><a href="//typhoon.yahoo.co.jp/weather/river/">河川水位</a></dd>

<dd class="list-w20"><a href="//typhoon.yahoo.co.jp/weather/jp/earthquake/">地震</a></dd>

<dd class="list-w47"><a href="//typhoon.yahoo.co.jp/weather/jp/tsunami/">津波</a></dd>

<dd class="list-w20"><a href="//typhoon.yahoo.co.jp/weather/jp/volcano/">火山</a></dd>

<dd class="list-w53"><a href="https://blogs.yahoo.co.jp/FRONT/OFFICIAL/official_list.html?pt=4">自治体の防災情報</a></dd>

<dd class="list-w47"><a href="//crisis.yahoo.co.jp/evacuation/">避難情報</a></dd>

<dd class="list-w53"><a href="//crisis.yahoo.co.jp/shelter/map/">避難所マップ</a></dd>

<dd class="list-w47"><a href="//typhoon.yahoo.co.jp/weather/jp/emergency/">緊急・被害状況</a></dd>

<dd class="list-w47"><a href="https://emg.yahoo.co.jp/sokuho/column/top/">防災コラム</a></dd>

<dd class="list-w47"><a href="https://emg.yahoo.co.jp/">防災速報</a></dd>

<dd class="list-w47"><a href="//typhoon.yahoo.co.jp/weather/calendar/">災害カレンダー</a></dd>

</dl>

</div>

・
・
省略
・
・
</body>

</html>

上記サンプルコードを実行したところ以下の警告が出ていました。

/usr/lib/python2.7/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file main.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))

どうやらパーサーをしっかり指定しないと別の環境で動かした際に異なるパーサーを使用して解析してしまうからちゃんとパーサーは指定しようねという内容のものでした。
というわけで直したのが以下のコードです。

#_*_ coding: utf-8 _*_

import urllib2

from bs4 import BeautifulSoup

def main():

    html = urllib2.urlopen('https://weather.yahoo.co.jp/weather/')

    print BeautifulSoup(html, "html.parser")

if __name__ == "__main__":

    main()

5.取得したページから情報を取得しよう

それでは取得したHTMLの情報からaタグのhref属性の値でhttp〜から始まる部分をすべて取得してみましょう。

5-1.サンプルコード

正規表現を使用してhttpを含むという条件で検索しています。

#_*_ coding: utf-8 _*_

import urllib2

from bs4 import BeautifulSoup

import re

def main():

    html = urllib2.urlopen('https://weather.yahoo.co.jp/weather/')

    parsed_html =  BeautifulSoup(html, "html.parser")

    print parsed_html.find_all("a", href=re.compile("http"))

if __name__ == "__main__":

    main()

5-2.実行結果

以下が実行した結果です。
結果が配列で格納されております。

[<a href="https://rdsig.yahoo.co.jp/weather/ult/pc/video/main/RV=1/RE=1509508882/RH=cmRzaWcueWFob28uY28uanA-/RB=/RU=aHR0cHM6Ly93ZWF0aGVyLnlhaG9vLmNvLmpwL3dlYXRoZXIvdmlkZW8vP2M9NTA5NjEx/RS=^ADAJitnmsmPkVmtGjvI8lXkH6ZvkQY-">\n<img src="https://iwiz-yvpub.c.yimg.jp/im_siggnNoWqCjdFrsChGzD0CQVBA---x148-y83-prix-bd1-bdx148-bdy83-bdc000000/d/yvpub-bucket001-west/contents/yvpub-content-59a815c5daefe65216d006d0dc06d035/images/yvpubthum509611-6c99ea3c220501f9332185cf38b6dd18.jpg"/><span>10/18\uff08\u6c34\uff097\u6642\u3000\u53f0\u98a821\u53f7\u306f\u5f37\u3044\u52e2\u529b\u3067\u5317\u4e0a\u4e2d\u3000\u5929\u6c17\u4e0b\u308a\u5742\u3000\u897f\u304b\u3089\u96e8\u96f2\u5e83\u304c\u308b</span>\n<span class="videoPlay">\u518d\u751f\u3059\u308b</span></a>,
・
・
省略
・
・
<a href="https://docs.yahoo.co.jp/docs/info/terms/chapter1.html#cf2nd">\u30d7\u30e9\u30a4\u30d0\u30b7\u30fc\u30dd\u30ea\u30b7\u30fc</a>, <a href="https://docs.yahoo.co.jp/docs/info/terms/">\u5229\u7528\u898f\u7d04</a>, <a href="https://feedback.ms.yahoo.co.jp/voc/weather-voc/input">\u3054\u610f\u898b\u30fb\u3054\u8981\u671b</a>, <a href="https://www.yahoo-help.jp/app/home/p/616/">\u30d8\u30eb\u30d7\u30fb\u304a\u554f\u3044\u5408\u308f\u305b</a>]

試しに先頭のものだけ取得してみましょう。

#_*_ coding: utf-8 _*_

import urllib2

from bs4 import BeautifulSoup

import re

def main():

    html = urllib2.urlopen('https://weather.yahoo.co.jp/weather/')

    parsed_html =  BeautifulSoup(html, "html.parser")

    link_list =  parsed_html.find_all("a", href=re.compile("http"))

    print link_list[0]

if __name__ == "__main__":

    main()

実行してみると取得した結果の配列の先頭の値がちゃんと取得できていることがわかるかと思います。

<a href="https://rdsig.yahoo.co.jp/weather/ult/pc/video/main/RV=1/RE=1509509248/RH=cmRzaWcueWFob28uY28uanA-/RB=/RU=aHR0cHM6Ly93ZWF0aGVyLnlhaG9vLmNvLmpwL3dlYXRoZXIvdmlkZW8vP2M9NTA5NjEx/RS=^ADA6ca4YMwOyMUrN9Y_PYCi344z08Q-">

<img src="https://iwiz-yvpub.c.yimg.jp/im_siggnNoWqCjdFrsChGzD0CQVBA---x148-y83-prix-bd1-bdx148-bdy83-bdc000000/d/yvpub-bucket001-west/contents/yvpub-content-59a815c5daefe65216d006d0dc06d035/images/yvpubthum509611-6c99ea3c220501f9332185cf38b6dd18.jpg"/><span>10/18（水）7時　台風21号は強い勢力で北上中　天気下り坂　西から雨雲広がる</span>

<span class="videoPlay">再生する</span></a>

6.参考資料

BeautifulSoupを利用することでurlilbから取得したHTMLファイルを解析して中の情報を取得することができました。
BeautifulSoupを詳しく知りたい場合はいかにドキュメントのURLを掲載しておきますのでご自身でその他にどういったことができるのか確認してみてください。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

以上

Sassyブログ

好きなことで暮らしを豊かにするブログ

Python入門 Webスクレイピングをしてみよう

目次

1.実行環境

2.スクレイピングとは

3.スクレイピング環境を準備しよう

3-1. BeautifulSoup

3-2. urllib2

4.ページを取得してみよう

4-1.サンプルコード

4-2.実行結果

5.取得したページから情報を取得しよう

5-1.サンプルコード

5-2.実行結果

6.参考資料