首页
畅所欲言
友情链接
壁纸大全
数据统计
推荐
工具箱
在线白板
Search
1
职教云小助手重构更新,职教云助手最新版下载地址【已和谐】
13,374 阅读
2
职教云-智慧职教,网课观看分析(秒刷网课)
10,986 阅读
3
gradle-5.4.1-all.zip下载
8,877 阅读
4
职教云-智慧职教,签到补签分析(逆天改命系列)
7,835 阅读
5
一个优秀的程序员从写文档开始:免费领14个月语雀云笔记会员
6,874 阅读
学习笔记
Web
Python
转载文章
算法刷题
JS逆向
综合笔记
安卓
物联网
Java
C
资源收集
软件收藏
网络资源
影视专辑
TED英语角
随便写写
随手拍
登录
/
注册
Search
Lan
累计撰写
624
篇文章
累计收到
617
条评论
首页
栏目
学习笔记
Web
Python
转载文章
算法刷题
JS逆向
综合笔记
安卓
物联网
Java
C
资源收集
软件收藏
网络资源
影视专辑
TED英语角
随便写写
随手拍
页面
畅所欲言
友情链接
壁纸大全
数据统计
推荐
工具箱
在线白板
搜索到
142
篇与
的结果
2020-04-22
自动生成和安装requirements.txt依赖
生成requirements.txt文件pip freeze > requirements.txt安装requirements.txt依赖pip install -r requirements.txt
2020年04月22日
714 阅读
0 评论
0 点赞
2020-04-21
用python获取易班文章评论信息
import requests url = 'https://www.yiban.cn/forum/reply/listAjax' headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36', } data = { 'channel_id': '289081', 'puid': '13088902', 'article_id': '121116137', 'page': '1', 'size': '200', 'order': '1', } html = requests.post(headers=headers, data=data, url=url).json() data = html['data']['list'] content = [] floor = [] createTime = [] name = [] nameid = [] nick = [] counts = len(data) for i in range(counts - 1): commen = data[str(i)] con = commen['content'] content.append(str(con).replace(' ', '')) floor.append(commen['floor']) createTime.append(commen['createTime']) name.append(commen['user']['name']) nameid.append(commen['user']['id']) nick.append(commen['user']['nick']) with open('result.csv', 'a+',encoding='utf-') as f: f.write('姓名}用户id}昵称}楼层}评论时间}评论内容') for i in range(len(name)): f.write( name[i] + "}" + nameid[i] + "}" + nick[i] + "}" + floor[i] + "}" + createTime[i] + "}" + content[i] + " ")
2020年04月21日
995 阅读
0 评论
0 点赞
2020-04-18
python代理IP池爬取
import parsel import requests url = '' headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36', } html = requests.get(url=url, headers=headers).text html = parsel.Selector(html) Ip = html.xpath('//td[@data-title="IP"]/text()').extract() Port = html.xpath('//td[@data-title="PORT"]/text()').extract() LeiXing = html.xpath('//td[@data-title="类型"]/text()').extract() result = [] for i in range(len(Ip)): a = (LeiXing[i] + '://' + Ip[i] + ':' + Port[i]) pro = {LeiXing[i]: a} result.append(pro) for i in result: try: ssss = requests.get(url='http://www.baidu.com', headers=headers, proxies=i, timeout=1).status_code if ssss == 200: print(i) except: print('不合格')
2020年04月18日
769 阅读
0 评论
0 点赞
2020-04-12
Python 3 速查卡
PDF下载地址:点击进入GitHub地址:https://perso.limsi.fr/pointal/python:memento
2020年04月12日
792 阅读
0 评论
0 点赞
2020-04-12
pycharm连接GitHub出现Connection reset
开启自动代理就好了
2020年04月12日
1,127 阅读
0 评论
0 点赞
2020-04-10
python爬虫头部文件自动加引号脚本
import re headers_str = ''' formhash: f0f241b5 qdxq: nu qdmode: 2 todaysay: fastreply: 0 ''' pattern = '^(.*?): (.*)$' for line in headers_str.splitlines(): print(re.sub(pattern, '\'\\1\': \'\\2\',', line))
2020年04月10日
682 阅读
0 评论
0 点赞
2020-03-27
【爬虫】python爬取MSDN站所有P2P下载链接
今日,msdn的新网站开放注册,然后体验了一波,发现要强制观看30S的广告才可以下载,因此就想提前把资源爬取下来以便后用。先来看下成果:1,网站分析1.1通过直接爬取:https://msdn.itellyou.cn/,可以获得8个ID,对应着侧边栏的八个分类1.2没展开一个分类,会发送一个POST请求传递的就是之前获取的8个ID之一1.3查看这个请求的返回值,可以看到又获得一个ID,以及对应的资源名称。1.4点击,展开一个资源可以发现,又多了两个POST请求1.4.1第一个GETLang,经分析大概意思就是,获取资源的语言,然后这个请求也发送了一个ID,然后在返回值中又获得一个ID,这就是后文中的lang值1.4.2第二个GetList,这个传递了三个参数:(1)ID:经对比可发现这个ID就是我们之前一直在用的ID。(2)lang,我后来才发现是language的缩写,就是语言的意思,我们从第一个GetLang的返回值可以获取,这个lang值。(3)filter,翻译成中文就是过滤器的意思,对应图片坐下角的红色框框内是否勾选。1.4.3到这里就以及在返回值中获得了下载地址了:综上就是分析过程。然后就开始敲代码了2,为了追求速度,选择了Scrapy框架。然后代码自己看吧。爬虫.py:# -*- coding: utf-8 -*- import json import scrapy from msdn.items import MsdnItem class MsdndownSpider(scrapy.Spider): name = 'msdndown' allowed_domains = ['msdn.itellyou.cn'] start_urls = ['http://msdn.itellyou.cn/'] def parse(self, response): self.index = [i for i in response.xpath('//h4[@class="panel-title"]/a/@data-menuid').extract()] # self.index_title = [i for i in response.xpath('//h4[@class="panel-title"]/a/text()').extract()] url = 'https://msdn.itellyou.cn/Category/Index' for i in self.index: yield scrapy.FormRequest(url=url, formdata={'id': i}, dont_filter=True, callback=self.Get_Lang, meta={'id': i}) def Get_Lang(self, response): id_info = json.loads(response.text) url = 'https://msdn.itellyou.cn/Category/GetLang' for i in id_info: # 遍历软件列表 lang = i['id'] # 软件ID title = i['name'] # 软件名 # 进行下一次爬取,根据lang(语言)id获取软件语言ID列表 yield scrapy.FormRequest(url=url, formdata={'id': lang}, dont_filter=True, callback=self.Get_List, meta={'id': lang, 'title': title}) def Get_List(self, response): lang = json.loads(response.text)['result'] id = response.meta['id'] title = response.meta['title'] url = 'https://msdn.itellyou.cn/Category/GetList' # 如果语言为空则跳过,否则进行下次爬取下载地址 if len(lang) != 0: # 遍历语言列表ID for i in lang: data = { 'id': id, 'lang': i['id'], 'filter': 'true' } yield scrapy.FormRequest(url=url, formdata=data, dont_filter=True, callback=self.Get_Down, meta={'name': title, 'lang': i['lang']}) else: pass def Get_Down(self, response): response_json = json.loads(response.text)['result'] item = MsdnItem() for i in response_json: item['name'] = i['name'] item['url'] = i['url'] print(i['name'] + "--------------" + i['url']) # 测试输出,为了运行时不太无聊 return itemitems.py:# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class MsdnItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() url = scrapy.Field()settings.py:# -*- coding: utf-8 -*- # Scrapy settings for msdn project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'msdn' SPIDER_MODULES = ['msdn.spiders'] NEWSPIDER_MODULE = 'msdn.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent # USER_AGENT = 'msdn (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 0.1 # The download delay setting will honor only one of: # CONCURRENT_REQUESTS_PER_DOMAIN = 16 # CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) # COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36' } # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # 'msdn.middlewares.MsdnSpiderMiddleware': 543, # } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # DOWNLOADER_MIDDLEWARES = { # 'msdn.middlewares.MsdnDownloaderMiddleware': 543, # } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html # EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, # } # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'msdn.pipelines.MsdnPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html # AUTOTHROTTLE_ENABLED = True # The initial download delay # AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies # AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: # AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # HTTPCACHE_ENABLED = True # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = 'httpcache' # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'pipelines.py:# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class MsdnPipeline(object): def __init__(self): self.file = open('msdnc.csv', 'a+', encoding='utf8') def process_item(self, item, spider): title = item['name'] url = item['url'] self.file.write(title + '*' + url + ' ') def down_item(self, item, spider): self.file.close()main.py(启动文件):from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'msdndown'])3,成品打包地址点击进入:csdn密码:lan666|大小:60kb已经过安全软件检测无毒,请您放心下载。
2020年03月27日
1,699 阅读
2 评论
0 点赞
1
...
17
18
19
...
21