{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "\n", "# 第三章 数据抓取\n", "\n", "Requests和Beautifulsoup简介\n", "\n", "\n", "![image.png](./images/author.png)\n" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-06-08T14:27:07.092873Z", "start_time": "2019-06-08T14:27:07.089169Z" }, "scrolled": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## 基本原理\n", "\n", "爬虫就是请求网站并提取数据的自动化程序。其中请求,提取,自动化是爬虫的关键!爬虫的基本流程:\n", "\n", "- 发起请求\n", " - 通过HTTP库向目标站点发起请求,也就是发送一个Request,请求可以包含额外的header等信息,等待服务器响应\n", "\n", "- 获取响应内容\n", " - 如果服务器能正常响应,会得到一个Response。Response的内容便是所要获取的页面内容,类型可能是HTML、Json字符串、二进制数据(图片或者视频)等类型\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- 解析内容\n", " - 得到的内容可能是HTML,可以用页面解析库、正则表达式进行解析;可能是Json,可以直接转换为Json对象解析;可能是二进制数据,可以做保存或者进一步的处理\n", "\n", "- 保存数据\n", " - 保存形式多样,可以存为文本,也可以保存到数据库,或者保存特定格式的文件\n", "\n", "浏览器发送消息给网址所在的服务器,这个过程就叫做**Http Request**;服务器收到浏览器发送的消息后,能够根据浏览器发送消息的内容,做相应的处理,然后把消息回传给浏览器,这个过程就是**Http Response**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 需要解决的问题 \n", "\n", "- 页面解析\n", "- 获取Javascript隐藏源数据\n", "- 自动翻页\n", "- 自动登录\n", "- 连接API接口\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "一般的数据抓取,使用requests和beautifulsoup配合就可以了。\n", "- 尤其是对于翻页时url出现规则变化的网页,只需要处理规则化的url就可以了。\n", "- 以简单的例子是抓取天涯论坛上关于某一个关键词的帖子。\n", " - 在天涯论坛,关于雾霾的帖子的第一页是:\n", "http://bbs.tianya.cn/list.jsp?item=free&nextid=0&order=8&k=雾霾\n", " - 第二页是:\n", "http://bbs.tianya.cn/list.jsp?item=free&nextid=1&order=8&k=雾霾\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## 第一个爬虫\n", "\n", "![](images/alice.png)\n", "\n", "Beautifulsoup Quick Start \n", "\n", "http://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", "\n", "\n", "http://computational-class.github.io/bigdata/data/test.html\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "'Once upon a time there were three little sisters,' the Dormouse began in a great hurry; 'and their names were Elsie, Lacie, and Tillie; and they lived at the bottom of a well--'\n", "\n", "\n", "\n", "'What did they live on?' said Alice, who always took a great interest in questions of eating and drinking.\n", "\n", "'They lived on treacle,' said the Dormouse, after thinking a minute or two.\n", "\n", "'They couldn't have done that, you know,' Alice gently remarked; 'they'd have been ill.'\n", "\n", "'So they were,' said the Dormouse; 'very ill.'\n", "\n", "**Alice's Adventures in Wonderland** CHAPTER VII A Mad Tea-Party http://www.gutenberg.org/files/928/928-h/928-h.htm" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T01:40:02.146927Z", "start_time": "2021-05-15T01:40:01.932107Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup " ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-11-03T03:32:12.203081Z", "start_time": "2020-11-03T03:32:09.078050Z" }, "slideshow": { "slide_type": "skip" } }, "source": [ "```\n", "import requests\n", "from bs4 import BeautifulSoup \n", "\n", "url = 'https://vp.fact.qq.com/home'\n", "content = requests.get(url)\n", "soup = BeautifulSoup(content.text, 'html.parser') \n", "\n", "```\n" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "ExecuteTime": { "end_time": "2020-11-03T03:49:07.016890Z", "start_time": "2020-11-03T03:49:07.013641Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function get in module requests.api:\n", "\n", "get(url, params=None, **kwargs)\n", " Sends a GET request.\n", " \n", " :param url: URL for the new :class:`Request` object.\n", " :param params: (optional) Dictionary, list of tuples or bytes to send\n", " in the query string for the :class:`Request`.\n", " :param \\*\\*kwargs: Optional arguments that ``request`` takes.\n", " :return: :class:`Response ` object\n", " :rtype: requests.Response\n", "\n" ] } ], "source": [ "help(requests.get)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T01:40:30.092870Z", "start_time": "2021-05-15T01:40:27.764670Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "url = 'https://socratesacademy.github.io/bigdata/data/test.html'\n", "content = requests.get(url)\n", "#help(content)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T01:45:19.650895Z", "start_time": "2021-05-15T01:45:19.647868Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n" ] } ], "source": [ "print(content.text)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T01:45:54.229935Z", "start_time": "2021-05-15T01:45:54.221219Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'utf-8'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content.encoding" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Beautiful Soup\n", "> Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:\n", "\n", "- Beautiful Soup provides a few simple methods. It doesn't take much code to write an application\n", "- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.\n", "- Beautiful Soup sits on top of popular Python parsers like `lxml` and `html5lib`.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Install beautifulsoup4\n", "\n", "open your terminal/cmd\n", "\n", " $ pip install beautifulsoup4" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### html.parser\n", "Beautiful Soup supports the html.parser included in Python’s standard library\n", "\n", "### lxml\n", "but it also supports a number of third-party Python parsers. One is the lxml parser `lxml`. Depending on your setup, you might install lxml with one of these commands:\n", "\n", "> $ apt-get install python-lxml\n", "\n", "> $ easy_install lxml\n", "\n", "> $ pip install lxml" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### html5lib\n", "Another alternative is the pure-Python html5lib parser `html5lib`, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:\n", "\n", "> $ apt-get install python-html5lib\n", "\n", "> $ easy_install html5lib\n", "\n", "> $ pip install html5lib" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T01:48:48.813482Z", "start_time": "2021-05-15T01:48:47.222732Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "

...

" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'http://socratesacademy.github.io/bigdata/data/test.html'\n", "content = requests.get(url)\n", "content = content.text\n", "soup = BeautifulSoup(content, 'html.parser') \n", "soup" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T01:48:59.451986Z", "start_time": "2021-05-15T01:48:59.448334Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " The Dormouse's story\n", " \n", " \n", " \n", "

\n", " \n", " The Dormouse's story\n", " \n", "

\n", "

\n", " Once upon a time there were three little sisters; and their names were\n", " \n", " Elsie\n", " \n", " ,\n", " \n", " Lacie\n", " \n", " and\n", " \n", " Tillie\n", " \n", " ;\n", "and they lived at the bottom of a well.\n", "

\n", "

\n", " ...\n", "

\n", " \n", "\n" ] } ], "source": [ "print(soup.prettify())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- html\n", " - head\n", " - title\n", " - body\n", " - p (class = 'title', 'story' )\n", " - a (class = 'sister')\n", " - href/id" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Select 方法\n", "\n", "\n", "- 标签名不加任何修饰\n", "- 类名前加点\n", "- id名前加 #\n", "\n", "我们也可以利用这种特性,使用soup.select()方法筛选元素,返回类型是 list" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Select方法三步骤\n", "\n", "- Inspect (检查)\n", "- Copy\n", " - Copy Selector\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- 鼠标选中标题`The Dormouse's story`, 右键检查Inspect\n", "- 鼠标移动到选中的源代码\n", "- 右键Copy-->Copy Selector \n", "\n", "`body > p.title > b`\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:02:06.793459Z", "start_time": "2021-05-15T02:02:06.789378Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('body > p.title > b')[0].text" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 通过标签名查找" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:03:52.198927Z", "start_time": "2021-05-15T02:03:52.194697Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('title')[0].text" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:04:14.141466Z", "start_time": "2021-05-15T02:04:14.137294Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('a')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:04:23.787514Z", "start_time": "2021-05-15T02:04:23.783236Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('b')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 通过类名查找" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:04:44.844325Z", "start_time": "2021-05-15T02:04:44.840200Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[

The Dormouse's story

]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('.title')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:04:52.867866Z", "start_time": "2021-05-15T02:04:52.863451Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('.sister')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:05:47.218607Z", "start_time": "2021-05-15T02:05:47.214047Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

,\n", "

...

]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('.story')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 通过id名查找" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:06:00.122987Z", "start_time": "2021-05-15T02:06:00.118356Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('#link1')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:06:34.890111Z", "start_time": "2021-05-15T02:06:34.886086Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'http://example.com/elsie'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('#link1')[0]['href']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法: 组合查找\n", "\n", "将标签名、类名、id名进行组合\n", "\n", "- 例如查找 p 标签中,id 等于 link1的内容\n", " " ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:07:24.429115Z", "start_time": "2021-05-15T02:07:24.425148Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select('p #link1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Select 方法:属性查找\n", "\n", "加入属性元素\n", "- 属性需要用大于号`>`连接\n", "- 属性和标签属于同一节点,中间不能加空格。\n", " \n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:07:56.662539Z", "start_time": "2021-05-15T02:07:56.658377Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select(\"head > title\")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:08:03.998680Z", "start_time": "2021-05-15T02:08:03.994092Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[

The Dormouse's story

,\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

,\n", "

...

]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.select(\"body > p\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## find_all方法" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:08:47.621921Z", "start_time": "2021-05-15T02:08:47.617950Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[

The Dormouse's story

,\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

,\n", "

...

]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#soup('p')\n", "soup.find_all('p')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2020-06-06T02:15:21.397409Z", "start_time": "2020-06-06T02:15:21.369088Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[

The Dormouse's story

,\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

,\n", "

...

]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('p') " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:09:09.820472Z", "start_time": "2021-05-15T02:09:09.816375Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[\"The Dormouse's story\",\n", " 'Once upon a time there were three little sisters; and their names were\\nElsie,\\nLacie and\\nTillie;\\nand they lived at the bottom of a well.',\n", " '...']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i.text for i in soup('p')]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:09:36.730551Z", "start_time": "2021-05-15T02:09:36.727221Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "for i in soup('p'):\n", " print(i.text)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:09:51.874753Z", "start_time": "2021-05-15T02:09:51.870515Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "html\n", "head\n", "title\n", "body\n", "p\n", "b\n", "p\n", "a\n", "a\n", "a\n", "p\n" ] } ], "source": [ "for tag in soup.find_all(True):\n", " print(tag.name)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:01.561490Z", "start_time": "2021-05-15T02:10:01.557739Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup('head') # or soup.head" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:09.650624Z", "start_time": "2021-05-15T02:10:09.646618Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[\n", "

The Dormouse's story

\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

\n", "

...

]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup('body') # or soup.body" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:13.554405Z", "start_time": "2021-05-15T02:10:13.550566Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup('title') # or soup.title" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:20.361370Z", "start_time": "2021-05-15T02:10:20.357523Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[

The Dormouse's story

,\n", "

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

,\n", "

...

]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup('p')" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "

The Dormouse's story

" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.p" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:46.279073Z", "start_time": "2021-05-15T02:10:46.275793Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'title'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.name" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:53.369269Z", "start_time": "2021-05-15T02:10:53.365521Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.string" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:10:59.969036Z", "start_time": "2021-05-15T02:10:59.965370Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "\"The Dormouse's story\"" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.text\n", "# 推荐使用text方法" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2020-06-06T02:18:16.349550Z", "start_time": "2020-06-06T02:18:16.340669Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'head'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.title.parent.name" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:11:21.857429Z", "start_time": "2021-05-15T02:11:21.853450Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "

The Dormouse's story

" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.p" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:11:32.544277Z", "start_time": "2021-05-15T02:11:32.540376Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['title']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.p['class']" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:13:12.189910Z", "start_time": "2021-05-15T02:13:12.185675Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[

Once upon a time there were three little sisters; and their names were\n", " Elsie,\n", " Lacie and\n", " Tillie;\n", " and they lived at the bottom of a well.

,\n", "

...

]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('p', {'class', 'story'}) " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:13:24.314037Z", "start_time": "2021-05-15T02:13:24.311706Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#soup.find_all('p', class_= 'title')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:13:32.002962Z", "start_time": "2021-05-15T02:13:31.998838Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a', {'class', 'sister'})" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-04-28T02:08:27.252239Z", "start_time": "2018-04-28T02:08:27.247016Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('p', {'class', 'story'})[0].find_all('a')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:14:06.047586Z", "start_time": "2021-05-15T02:14:06.043761Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "Elsie" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.a" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:14:10.817296Z", "start_time": "2021-05-15T02:14:10.813436Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup('a')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:14:38.104275Z", "start_time": "2021-05-15T02:14:38.100394Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Elsie" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find(id=\"link1\")" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:14:41.664424Z", "start_time": "2021-05-15T02:14:41.660615Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a')" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:14:45.672192Z", "start_time": "2021-05-15T02:14:45.667941Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a', {'class', 'sister'}) # compare with soup.find_all('a')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "ExecuteTime": { "end_time": "2021-05-15T02:14:48.888543Z", "start_time": "2021-05-15T02:14:48.884296Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "Elsie" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a', {'class', 'sister'})[0]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'Elsie'" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a', {'class', 'sister'})[0].text " ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'http://example.com/elsie'" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a', {'class', 'sister'})[0]['href']" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'link1'" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all('a', {'class', 'sister'})[0]['id']" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "ExecuteTime": { "end_time": "2020-11-03T03:42:28.907584Z", "start_time": "2020-11-03T03:42:28.903704Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[The Dormouse's story,\n", " Elsie,\n", " Lacie,\n", " Tillie]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all([\"a\", \"b\"])" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "ExecuteTime": { "end_time": "2020-11-03T03:43:23.483217Z", "start_time": "2020-11-03T03:43:23.480006Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dormouse's story\n", "\n", "The Dormouse's story\n", "Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.\n", "...\n" ] } ], "source": [ "print(soup.get_text())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![image.png](./images/end.png)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "100px", "left": "1287.36px", "top": "0px", "width": "130.656px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }