在:
【记录】分析xxxapp中数据来源和如何爬取-1
和
【记录】分析xxxapp中数据来源和如何爬取-2
之后,去整理爬取数据的逻辑。
此处开始整理爬取数据的思路:
爬取的逻辑:
先去获取category=all的get_course_list
https://childapi.xxx.com/course/get_course_list?sign=4e89d3ae91d6e5c2a05908be477e913d&uid=0&sort=new&level=all&nature_style=all&start=0&category_id=1&auth_token=0&ishow=0&nature_area=all&rows=10&nature_id=all &stamp=1536819997
然后每次rows=10,每次增加10个,start=0,10,20等等
去获取所有的category的list,其中包含category的id
最多实测有36000+个category id
然后针对每个category去获取详情:
https://childapi.xxx.com/course/detail_new?course_id=61288
保存mp4视频,mp3音频,srt字幕,以及json详情信息
针对每个course,去获取:
已配音小伙伴的
https://childapi.xxx.com/course/last_show_peoples?sign=da0a90e3c7ee2f7656cbae6575c3521d&stamp=1536895345&uid=0&auth_token=0&course_id=59901&start=0&rows=20
和:
点赞榜的
https://childapi.xxx.com/StudyShow/course_show?sign=da0a90e3c7ee2f7656cbae6575c3521d&stamp=1536895345&uid=0&auth_token=0&course_id=59901&start=0&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[]
播放榜的
https://childapi.xxx.com/StudyShow/viewTop?sign=4ccb35d6f3b4cfecc81c9b02462c5604&stamp=1536895896&uid=0&auth_token=0&course_id=59901&start=0&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[]
另外(为了获取更多用户,则)单独再去:
底部Tab 排行榜-》学霸
的
https://childapi.xxx.com/top/sign_top?sign=17c411d4c4d011c614b7cef633fc68ce&stamp=1536909875&uid=0&auth_token=0&area_id=0&start=40&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[],实测发现不超过200个
底部Tab 排行榜-》人气
https://childapi.xxx.com/top/shownews_top_redis?sign=f00a8402d9ea4f26a0eee153aba71fb3&stamp=1536910053&uid=0&time_type=1&area_id=0&ranking_type=0&start=40&auth_token=0&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[],实测发现不超过200个
对于每个用户:
获取详情是:
https://childapi.xxx.com/member?sign=a5d2f9bd3f98214ab3cf8de47e14833c&stamp=1536917132&uid=0&auth_token=0&member_id=4864238
然后再去后去每个用户的作品列表:
https://childapi.xxx.com/member/show_list?sign=9fbde32314aff6b77be9b64b2464f78a×tamp=1536917132&uid=0&auth_token=0&start=0&member_id=4864238&rows=20
对于每个用户的作品,都是一个show,则再去获取show的详情:
https://childapi.xxx.com/show/detail?show_id=60377593
另外,为了获取更多其他用户,对于每个用户再去获取对应关注和粉丝:
https://childapi.xxx.com/member/follows?auth_token=0&member_id=32527428&rows=20&start=0×tamp=1536919513254&uid=0&sign=e56d81d7373be0b53ecc8fe16cb2df2c
和:
https://childapi.xxx.com/member/fans?auth_token=0&member_id=13494467&rows=20&start=0×tamp=1536919109752&uid=0&sign=4be774e38c0619529b1c7556011fd9ba
另外再为了尽量扩大,获取更多的course,user,show,所以在可能的情况下,尽量也去通过得到的id,获得相关的内容
比如:
show的detail中,获取user和course
–>>
保存数据结果的形式:
- course
- 61288
- course_61288_video.mp4
- course_61288_audio.mp3
- course_61288_subtitle.srt
- course_61288_info.json
- 保存course的信息
- course_61288_shows.json
- 保存每个course对应有多少个shows
- 后续获取show的详情时,同步更新保存进来
- 48050
- …
- user
- 4864238
- show
- 144138897
- show_144138897_video.mp4
- show_144138897_info.json
- 保存了该show的详细信息
- 135839205
- …
- user_4864238_info.json
- 保存了该用户的详细信息
注:
举例
(1)course_61288_info.json
{
"id": "61288",
"title": "我们有很多共同点",
"description": "麦洛找了很多它和小猫的共同点,试图让小猫不那么害羞,可是小猫并不吃这一套。",
"video": "
",
"video_srt": "
",
"audio": "
",
"pic": "
https://img.xxx.cn/2018-09-10/5b9634a3d742d.jpg
",
"if_subtitle": "0",
"dif_level": "3",
"subtitle_en": "
https://cdn2.xxx.cn/2018-09-10/15365514894833.srt
",
"subtitle_num": "6",
"category_id": "10",
"shows": "353",
"views": "2270",
"editor": "xxxMargaret",
"editor_uid": "0",
"tag": "小兔子麦洛,野猫",
"status": "1",
"create_time": "2016-01-26 11:38",
"isalbum": "0",
"top": "0",
"ifshow": "1",
"update_time": "1536658200",
"sort": "-249308",
"check": "1",
"copyright": "1",
"is_vip": "0",
"show_peoples": "350",
"score_peoples": "38",
"is_score": "2",
"is_needbuy": "0",
"duration": "28",
"bookOriginalId": "0",
"permit_client": "2,4,1,3",
"assign_times": "0",
"copy": "视频片段摘自:“Milo and the wild cat _ Cartoon for kids”,本视频仅供免费学习使用!如需观看完整版,请支持正版!",
"redirect": {
"title": "",
"url": "",
"sort": "0"
},
"editors": [
{
"title": "上传",
"nickname": "xxxMargaret",
"uid": 0
},
{
"title": "听译",
"nickname": "妞妞的蛋卷",
"uid": 0
},
{
"title": "审校",
"nickname": "酥酥Jesus",
"uid": 0
},
{
"title": "制作",
"nickname": "徐婷",
"uid": 0
}
],
"share_talk": "",
"share_pic": "
https://img.xxx.cn/2018-09-10/5b9634a3d742d.jpg
",
"share_url": "
https://child.xxx.cn/index.php?m=home&c=Activity&a=childshare_video&course=MDAwMDAwMDAwMLGdpquCe8yh
",
"score_type": 4,
"score_weight": [
{
"low": "0",
"height": "55",
"weight": "1.20"
},
{
"low": "56",
"height": "70",
"weight": "1.15"
},
{
"low": "71",
"height": "80",
"weight": "1.10"
},
{
"low": "81",
"height": "90",
"weight": "1.05"
},
{
"low": "91",
"height": "100",
"weight": "1.00"
}
],
"album_id": null,
"is_strate": "0",
"if_strate_buy": "0",
"strate_audio_id": "0",
"category": "今日更新",
"nature": "",
"album_title": "",
"share_title": "我发现了一个超有意思的今日更新片段",
"share_desc": "《我们有很多共同点》,快来围观吧",
"share_friend": "我发现了一个超有意思的今日更新片段《我们有很多共同点》",
"skip_url": "",
"strate_url": "",
"strate_isbuy": 0,
"strate_pic": "
https://img.xxx.cn/strate_detail.png
",
"album_isbuy": 0,
"feedback_url": "
https://child.xxx.cn/home/basic/course_feedback?uid=0&course_id=61288
",
"video_adver": [],
"course_adver": [
{
"id": "6427",
"title": "想和老外“无障碍交流”?来这里",
"pic": "
https://img.xxx.cn/2018-09-11/5b97cdc9be2b8.jpg
",
"type": "custom",
"son_type": "",
"show_id": "0",
"is_share": "0",
"content": "",
"scheme_url": "",
"sub_title": "",
"show_type": "0",
"sort": "0",
"shows": "10985",
"views": "392",
"weight": "10",
"score_type": "2",
"score": "2",
"share_pic": "
https://img.xxx.cn/2018-09-11/5b97cdc9be2b8.jpg
",
"html": "",
"clickreport": [],
"displayreport": [],
"url": "
http://shaoer.xxx.com/basic/slider?adv=MDAwMDAwMDAwMLGdsquBrqKh
"
}
]
}(2)user_4864238_info.json
{
"id": "4864238",
"uc_id": "8638661",
"nickname": "夏竹",
"avatar": "
https://img.xxx.cn/2018-06-30/5b3734cb58bee.jpg
",
"mobile": "",
"app_type": "2",
"version": "3.9.0",
"push_info": "{\"comments\":\"1\"}",
"signature": "其实皮一下很开心",
"birthday": "2017-09-24",
"sex": "2",
"school": "1435652",
"area": "4403",
"type": "2",
"status": "1",
"reg_time": "1465926412",
"fans": "973",
"uid": "4864238",
"follows": "1516",
"views": "228",
"photos": "71",
"guestbooks": "0",
"shows": "118",
"words": "0",
"collects": "4",
"school_str": "h华x新小学",
"is_black": "0",
"support_collect": "4",
"cover": "
https://img.xxx.cn/cover_default.jpg
",
"medal": [],
"is_crown": "0",
"search_id": "4927133",
"ugcactive": {
"icon": "",
"title": "字幕组",
"sub_title": "新活动",
"url": "
http://ugctest.xxx.cn/app/index/index?
"
},
"is_following": "0",
"is_follow": "0",
"follow_nickname": "",
"user_number": "8862171",
"score_time": "",
"vip_endtime": "0",
"is_vip": "0",
"libu_level": "0",
"claims_url": "
https://child.xxx.cn/home/basic/report?type=3&tyid=0&uid=MDAwMDAwMDAwMLB0nm8&member_id=MDAwMDAwMDAwMLF3yGSBe67esaR0cg
",
"dav": "",
"dv_type": "0",
"dv_status": "0",
"libu_vip_endtime": "0",
"libu_vip": "0",
"libu_first": "0"
}(3)show_144138897_info.json
{
"id": "144138897",
"uid": "4864238",
"course_id": "56931",
"album_id": "3353",
"video": "
",
"create_time": "2018-07-12 16:46",
"score": "91",
"comments": "0",
"supports": "2",
"views": "0",
"diamonds": "0",
"nickname": "夏竹",
"avatar": "",
"school": "1435652",
"area": "4403",
"birthday": "2017-09-24",
"course_title": "你是谁呢",
"pic": "
https://img.xxx.cn/2018-07-03/5b3b28404a45b.jpg
",
"permit_client": "2,4,1,3",
"permit_show": "1",
"is_crown": "0",
"school_str": "",
"dav": "",
"dv_type": "0",
"is_vip": "0"
}(4)course_61288_shows.json
{
"total": 1,
"shows": [
{
"id": "155469749",
"uid": "2248187",
"nickname": "Aegla",
"course_id": "15159",
"course_title": "不要低估他",
"video": "
",
}
]
}转载请注明:在路上 » 【记录】整理爬取趣配音app的数据的逻辑