Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
M
media_data_crawler
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
zhiwei
media_data_crawler
Commits
daa0d81c
Commit
daa0d81c
authored
Aug 01, 2018
by
yangchen
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
贴吧采集修改
parent
4f5dfa32
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
28 additions
and
18 deletions
+28
-18
src/main/java/com/zhiwei/media_data_crawler/crawler/BaiduTiebaCrawlerParse.java
+28
-18
No files found.
src/main/java/com/zhiwei/media_data_crawler/crawler/BaiduTiebaCrawlerParse.java
View file @
daa0d81c
...
...
@@ -117,29 +117,12 @@ public class BaiduTiebaCrawlerParse extends HttpClientTemplateOK {
String
content
=
null
;
String
author
=
null
;
String
tid
=
null
;
System
.
out
.
println
(
"-------------------------"
);
// System.out.println(element.toString());
author
=
element
.
select
(
"li.d_name"
).
select
(
"a"
).
text
();
content
=
element
.
select
(
"div.p_content_nameplate"
).
select
(
"cc"
).
select
(
"div.clearfix"
).
text
();
if
(
content
==
null
||
content
.
length
()
<
1
)
{
content
=
element
.
select
(
"div.j_d_post_content"
).
text
();
}
if
(
time
==
null
||
time
.
length
()
<
1
)
{
time
=
element
.
select
(
"span.tail-info"
).
text
();
if
(
time
.
contains
(
"楼"
))
{
time
=
time
.
split
(
"楼"
)[
1
].
trim
();
}
}
if
(
time
==
null
||
time
.
trim
().
length
()
<
1
)
{
Pattern
pa
=
Pattern
.
compile
(
"date":"(.*?)""
);
Matcher
ma
=
pa
.
matcher
(
element
.
toString
());
while
(
ma
.
find
())
{
time
=
ma
.
group
(
0
);
time
=
time
.
split
(
"date":""
)[
1
].
split
(
"""
)[
0
];
break
;
}
}
time
=
getTime
(
element
);
Pattern
pa2
=
Pattern
.
compile
(
"post_id"(.*?),""
);
Matcher
ma2
=
pa2
.
matcher
(
element
.
toString
());
while
(
ma2
.
find
())
{
...
...
@@ -163,6 +146,33 @@ public class BaiduTiebaCrawlerParse extends HttpClientTemplateOK {
}
/**
*
* @Description 百度贴吧获取时间
* @param element
* @return
*/
private
static
String
getTime
(
Element
element
)
{
String
time
=
null
;
if
(
time
==
null
||
time
.
length
()
<
1
)
{
time
=
element
.
select
(
"span.tail-info"
).
text
();
if
(
time
.
contains
(
"楼"
))
{
time
=
time
.
split
(
"楼"
)[
1
].
trim
();
}
}
if
(
time
==
null
||
time
.
trim
().
length
()
<
1
)
{
Pattern
pa
=
Pattern
.
compile
(
"date":"(.*?)""
);
Matcher
ma
=
pa
.
matcher
(
element
.
toString
());
while
(
ma
.
find
())
{
time
=
ma
.
group
(
0
);
time
=
time
.
split
(
"date":""
)[
1
].
split
(
"""
)[
0
];
break
;
}
}
return
time
;
}
/**
* @Title: downloadHtml
* @author hero
* @Description: 下載百度貼吧具体页面数据
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment