分类 Python 下的文章 - Lan小站-嗯，不错！

登录 / 注册

Lan

累计撰写 618 篇文章
累计收到 629 条评论

搜索到 143 篇与的结果

2020-03-27
【爬虫】python爬取MSDN站所有P2P下载链接今日，msdn的新网站开放注册，然后体验了一波，发现要强制观看30S的广告才可以下载，因此就想提前把资源爬取下来以便后用。先来看下成果：1，网站分析1.1通过直接爬取：https://msdn.itellyou.cn/，可以获得8个ID，对应着侧边栏的八个分类1.2没展开一个分类，会发送一个POST请求传递的就是之前获取的8个ID之一1.3查看这个请求的返回值，可以看到又获得一个ID，以及对应的资源名称。1.4点击，展开一个资源可以发现，又多了两个POST请求1.4.1第一个GETLang，经分析大概意思就是，获取资源的语言，然后这个请求也发送了一个ID，然后在返回值中又获得一个ID，这就是后文中的lang值1.4.2第二个GetList，这个传递了三个参数：（1）ID：经对比可发现这个ID就是我们之前一直在用的ID。（2）lang，我后来才发现是language的缩写，就是语言的意思，我们从第一个GetLang的返回值可以获取，这个lang值。（3）filter，翻译成中文就是过滤器的意思，对应图片坐下角的红色框框内是否勾选。1.4.3到这里就以及在返回值中获得了下载地址了：综上就是分析过程。然后就开始敲代码了2,为了追求速度，选择了Scrapy框架。然后代码自己看吧。爬虫.py：# -*- coding: utf-8 -*- import json import scrapy from msdn.items import MsdnItem class MsdndownSpider(scrapy.Spider): name = 'msdndown' allowed_domains = ['msdn.itellyou.cn'] start_urls = ['http://msdn.itellyou.cn/'] def parse(self, response): self.index = [i for i in response.xpath('//h4[@class="panel-title"]/a/@data-menuid').extract()] # self.index_title = [i for i in response.xpath('//h4[@class="panel-title"]/a/text()').extract()] url = 'https://msdn.itellyou.cn/Category/Index' for i in self.index: yield scrapy.FormRequest(url=url, formdata={'id': i}, dont_filter=True, callback=self.Get_Lang, meta={'id': i}) def Get_Lang(self, response): id_info = json.loads(response.text) url = 'https://msdn.itellyou.cn/Category/GetLang' for i in id_info: # 遍历软件列表 lang = i['id'] # 软件ID title = i['name'] # 软件名 # 进行下一次爬取，根据lang(语言)id获取软件语言ID列表 yield scrapy.FormRequest(url=url, formdata={'id': lang}, dont_filter=True, callback=self.Get_List, meta={'id': lang, 'title': title}) def Get_List(self, response): lang = json.loads(response.text)['result'] id = response.meta['id'] title = response.meta['title'] url = 'https://msdn.itellyou.cn/Category/GetList' # 如果语言为空则跳过，否则进行下次爬取下载地址 if len(lang) != 0: # 遍历语言列表ID for i in lang: data = { 'id': id, 'lang': i['id'], 'filter': 'true' } yield scrapy.FormRequest(url=url, formdata=data, dont_filter=True, callback=self.Get_Down, meta={'name': title, 'lang': i['lang']}) else: pass def Get_Down(self, response): response_json = json.loads(response.text)['result'] item = MsdnItem() for i in response_json: item['name'] = i['name'] item['url'] = i['url'] print(i['name'] + "--------------" + i['url']) # 测试输出，为了运行时不太无聊 return itemitems.py:# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class MsdnItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() url = scrapy.Field()settings.py:# -*- coding: utf-8 -*- # Scrapy settings for msdn project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'msdn' SPIDER_MODULES = ['msdn.spiders'] NEWSPIDER_MODULE = 'msdn.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = 'msdn (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 0.1 # The download delay setting will honor only one of: # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) # COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36' } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # 'msdn.middlewares.MsdnSpiderMiddleware': 543, # } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'msdn.middlewares.MsdnDownloaderMiddleware': 543, # } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, # } # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'msdn.pipelines.MsdnPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html # AUTOTHROTTLE_ENABLED = True # The initial download delay # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # HTTPCACHE_ENABLED = True # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = 'httpcache' # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'pipelines.py:# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class MsdnPipeline(object): def __init__(self): self.file = open('msdnc.csv', 'a+', encoding='utf8') def process_item(self, item, spider): title = item['name'] url = item['url'] self.file.write(title + '*' + url + ' ') def down_item(self, item, spider): self.file.close()main.py(启动文件）:from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'msdndown'])3，成品打包地址点击进入：csdn密码：lan666|大小：60kb已经过安全软件检测无毒，请您放心下载。
- 2020年03月27日
- 1,917 阅读
- 2 评论
- 0 点赞
2020-03-22
python3如何实现一行输入，空格隔开 a,b=map(int,input().split())如果有多个变量只需a,b,c=map(int,input().split())用逗号隔开a,b,c=map(int,input().split(','))
- 2020年03月22日
- 886 阅读
- 0 评论
- 0 点赞
2020-03-17
python 列表解析使用快捷方式创建列表list = [i**2 for i in range(1,11)] //快捷创建一个列表，内容为1到10的数的平方使用range()时，如果输出不符合预期，请尝试将指定的值加一或减一，这就是在编程语言中经常看到的差一行行为的结果
- 2020年03月17日
- 862 阅读
- 0 评论
- 0 点赞
2020-03-15
python pip国内镜像安装方法以及pycharm换源 1.更换源pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple2.升级源python -m pip install --upgrade pip在线安装pip install 模块名如果网络不好可以使用国内镜像， pip install xx -i http://xxx 国内的几个常用镜像地址：豆瓣： https://pypi.douban.com/simple中国科学科技大学： https://mirrors.ustc.edu.cn/pypi/web/simple/清华大学：https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple/离线安装下载好压缩包 ->解压 -> 在解压目录的当前文件夹下，打开终端python setup.py install.whl文件安装：pip install xxx.whlPycharm换源
- 2020年03月15日
- 1,366 阅读
- 0 评论
- 0 点赞
2020-03-15
Python 操作配置文件 Python 标准库的 ConfigParser 模块提供了一套完整的 API 来读取和操作配置文件。文件格式配置文件中包含一个或多个 section,每个 section 都有自己的 option;section 用 [sect_name] 表示，每个 option 是一个键值对，使用分隔符 = 或者 : 隔开；在 option 分隔符两端的空格会被忽略掉；配置文件使用 # 注释；示例配置文件 dbconf.cfg;[dbconfig]# 数据库读库链接信息host=127.0.0.1user=root passwd=root database=banma_finance port=3306示例配置文件 book.info[book]# 标题title: Core Pythonversion: 2016009021[hardcopy]pages:350操作配置文件配置文件的读取1 实例化 ConfigParser# 实例化 CoinfigParser 并加载配置文件 # 实例化 dbconf 的解析器 db_config_parser = ConfigParser.SafeConfigParser() db_config_parser.read('dbconf.cfg') # 实例化 book_info 的解析器 book_info_parser = ConfigParser.SafeConfigParser() book_info_parser.read('book.info')2 读取文件节点信息# 获取 section 信息 print db_config_parser.sections() print book_info_parser.sections() # 打印书籍的大写名称 print string.upper(book_info_parser.get("book","title")) print "by", book_info_parser.get("book","author") # 格式化输出 dbconf 中的配置信息 for section in db_config_parser.sections(): print section for option in db_config_parser.options(section): print " ", option,"=",db_config_parser.get(section,option)输出结果：['dbconfig']['book', 'hardcopy']CORE PYTHONby Jack dbconfig host = 127.0.0.1 user = root passwd = root database = banma_finance port = 3306配置文件的写入配置文件的写入与配置文件的读取方式基本一致，都是先操作对应的section，然后在 section 下面写入对应的 option;# !/usr/bin/python # coding:utf-8 import ConfigParser sys # 初始化 ConfigParserconfig_writer = ConfigParser.ConfigParser() # 添加 book 节点 config_writer.add_section("book") # book 节点添加 title,author 配置 config_writer.set("book","title","Python: The Hard Way") config_writer.set("book","author","anon") # 添加 ematter 节点和 pages 配置 config_writer.add_section("ematter") config_writer.set("ematter","pages",250) # 将配置信息输出到标准输出 config_writer.write(sys.stdout) # 将配置文件输出到文件 config_writer.write(open('new_book.info','w'))输出结果：[book]title = Python: The Hard Wayauthor = anon[ematter]pages = 250配置文件的更新配置文件的更新操作，可以说是读取和写入的复合操作。如果没有最终的 write 操作，对于配置文件的读写都不会真正改变配置文件信息。# !/usr/bin/python# coding:utf-8import ConfigParserimport sysreload(sys)sys.setdefaultencoding('UTF-8')# 初始化 ConfigParserupdate_config_parser = ConfigParser.ConfigParser()update_config_parser.read('new_book.info')print "section 信息:",update_config_parser.sections()# 更新作者名称print "原作者:",update_config_parser.get("book","author")# 更改作者姓名为 Jackupdate_config_parser.set("book","author","Jack")print "更改后作者名称:",update_config_parser.get("book","author")# 如果 ematter 节点存在,则删除if update_config_parser.has_section("ematter"): update_config_parser.remove_section("ematter")# 输出信息update_config_parser.write(sys.stdout)# 覆盖原配置文件信息update_config_parser.write(open('new_book.info','w'))
- 2020年03月15日
- 918 阅读
- 0 评论
- 0 点赞
2020-03-15
python time常用格式化常用的时间函数如下获取当前日期：time.time()获取元组形式的时间戳：time.local(time.time())格式化日期的函数(基于元组的形式进行格式化)：（1）time.asctime(time.local(time.time()))（2）time.strftime(format[,t])将格式字符串转换为时间戳：time.strptime(str,fmt='%a %b %d %H:%M:%S %Y')延迟执行：time.sleep([secs])，单位为秒例1：# -*- coding:utf-8 -*- import time #当前时间 print time.time() #时间戳形式 print time.localtime(time.time()) #简单可读形式 print time.asctime( time.localtime(time.time()) ) # 格式化成2016-03-20 11:45:39形式 print time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) # 格式化成Sat Mar 28 22:24:24 2016形式 print time.strftime("%a %b %d %H:%M:%S %Y", time.localtime()) # 将格式字符串转换为时间戳 a = "Sat Mar 28 22:24:24 2016" print time.mktime(time.strptime(a,"%a %b %d %H:%M:%S %Y"))输出为1481036968.19 time.struct_time(tm_year=2016, tm_mon=12, tm_mday=6, tm_hour=23, tm_min=9, tm_sec=28, tm_wday=1, tm_yday=341, tm_isdst=0) Tue Dec 06 23:09:28 2016 2016-12-06 23:09:28 Tue Dec 06 23:09:28 2016 1459175064.0例2：某时间与当前比较，如果大于当前时间则调用某个脚本，否则等待半个小时候后继续判断#判断当前时间是否超过某个输入的时间# -*- coding:utf-8 -*- import time import sys import os def Fuctime(s): if time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))>s: return True else: return False while(1): if Fuctime('2016-12-05 00:00:00'): #调用某个路径下的脚本的简便方法 os.system("python ./../day_2/Prime.py ./../day_2/inti_prime.txt ./../day_2/res_prime.txt") break else: time.sleep(1800) continuepython中时间日期格式化符号：%y 两位数的年份表示（00-99）%Y 四位数的年份表示（000-9999）%m 月份（01-12）%d 月内中的一天（0-31）%H 24小时制小时数（0-23）%I 12小时制小时数（01-12）%M 分钟数（00=59）%S 秒（00-59）%a 本地简化星期名称%A 本地完整星期名称%b 本地简化的月份名称%B 本地完整的月份名称%c 本地相应的日期表示和时间表示%j 年内的一天（001-366）%p 本地A.M.或P.M.的等价符%U 一年中的星期数（00-53）星期天为星期的开始%w 星期（0-6），星期天为星期的开始%W 一年中的星期数（00-53）星期一为星期的开始%x 本地相应的日期表示%X 本地相应的时间表示%Z 当前时区的名称%% %号本身
- 2020年03月15日
- 793 阅读
- 0 评论
- 0 点赞
2020-03-15
Python cookie保存为本地文件，二次利用 There is no immediate way to do so, but it's not hard to do.You can get a CookieJar object from the session as session.cookies, you can use pickle to store it to a file.A full example:import requests, pickle session = requests.session() # Make some calls with open('somefile', 'wb') as f: pickle.dump(session.cookies, f)Loading is then:session = requests.session() # or an existing session with open('somefile', 'rb') as f: session.cookies.update(pickle.load(f))The requests library has uses the requests.cookies.RequestsCookieJar() subclass, which explicitly supports pickling and a dict-like API, and you can use the RequestsCookieJar.update() method to update an existing session cookie jar with those loaded from a pickle file.
- 2020年03月15日
- 812 阅读
- 0 评论
- 0 点赞