Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
S
soubao_crawler
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
zhiwei
soubao_crawler
Commits
1a02bc2f
Commit
1a02bc2f
authored
Sep 20, 2018
by
[zhangzhiwei]
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
添加清洗中间件并删除清洗及存储中间件
parent
336b4931
Hide whitespace changes
Inline
Side-by-side
Showing
16 changed files
with
205 additions
and
255 deletions
+205
-255
pom.xml
+32
-52
src/main/java/com/zhiwei/crawler/dao/MediaDataDAO.java
+0
-44
src/main/java/com/zhiwei/crawler/dao/ProducerKafka.java
+60
-0
src/main/java/com/zhiwei/crawler/dao/WordsDao.java
+3
-3
src/main/java/com/zhiwei/crawler/dbtemplate/MongoDBTemplate.java
+4
-5
src/main/java/com/zhiwei/crawler/dbtemplate/RsidClientTemplate.java
+55
-9
src/main/java/com/zhiwei/crawler/run/MainRun.java
+7
-0
src/main/java/com/zhiwei/crawler/run/SoubaoCrawlerRun.java
+3
-1
src/main/java/com/zhiwei/crawler/soubao/SouBaoCrawlerThread.java
+3
-3
src/main/java/com/zhiwei/crawler/util/ProxyClientUtil.java
+0
-33
src/main/java/com/zhiwei/crawler/util/ReadFileTools.java
+0
-43
src/main/java/com/zhiwei/crawler/util/TreatData.java
+13
-7
src/main/resources/config.json
+0
-15
src/main/resources/configTest.json
+0
-15
src/main/resources/db.properties
+20
-20
src/main/resources/proxyip.properties
+5
-5
No files found.
pom.xml
View file @
1a02bc2f
<project
xmlns=
"http://maven.apache.org/POM/4.0.0"
<project
xmlns=
"http://maven.apache.org/POM/4.0.0"
xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=
"http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
>
xsi:schemaLocation=
"http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
>
<modelVersion>
4.0.0
</modelVersion>
<modelVersion>
4.0.0
</modelVersion>
<groupId>
com.zhiwei.crawler
</groupId>
<groupId>
com.zhiwei.crawler
</groupId>
<artifactId>
soubao-crawlerNew
</artifactId>
<artifactId>
soubao-crawlerNew
</artifactId>
<version>
0.0.2-SNAPSHOT
</version>
<version>
0.0.2-SNAPSHOT
</version>
<name>
搜报网采集
</name>
<name>
搜报网采集
</name>
<properties>
<properties>
<project.build.sourceEncoding>
UTF-8
</project.build.sourceEncoding>
<project.build.sourceEncoding>
UTF-8
</project.build.sourceEncoding>
<project.reporting.outputEncoding>
UTF-8
</project.reporting.outputEncoding>
<project.reporting.outputEncoding>
UTF-8
</project.reporting.outputEncoding>
</properties>
</properties>
...
@@ -18,69 +17,49 @@
...
@@ -18,69 +17,49 @@
<email>
zhangzhiwei@zhiweidata.com
</email>
<email>
zhangzhiwei@zhiweidata.com
</email>
</developer>
</developer>
</developers>
</developers>
<dependencies>
<dependencies>
<!-- https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp -->
<dependency>
<groupId>
com.squareup.okhttp3
</groupId>
<artifactId>
okhttp
</artifactId>
<version>
3.10.0
</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-core -->
<dependency>
<groupId>
org.apache.logging.log4j
</groupId>
<artifactId>
log4j-core
</artifactId>
<version>
2.11.0
</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-api -->
<dependency>
<dependency>
<groupId>
org.apache.logging.log4j
</groupId>
<groupId>
org.mongodb
</groupId>
<artifactId>
log4j-api
</artifactId>
<artifactId>
mongo-java-driver
</artifactId>
<version>
2.11.0
</version>
<version>
3.8.1
</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>
org.jsoup
</groupId>
<artifactId>
jsoup
</artifactId>
<version>
1.11.3
</version>
</dependency>
</dependency>
<dependency>
<dependency>
<groupId>
org.mongodb
</groupId>
<groupId>
com.zhiwei.middleware
</groupId>
<artifactId>
mongo-java-driver
</artifactId>
<artifactId>
proxy-client
</artifactId>
<version>
3.6.3
</version>
<version>
0.0.2-RELEASE
</version>
</dependency>
</dependency>
<dependency>
<dependency>
<groupId>
com.zhiwei
</groupId>
<groupId>
com.zhiwei
.tools
</groupId>
<artifactId>
mongosave-service-client
</artifactId>
<artifactId>
zhiwei-tools
</artifactId>
<version>
1.0
-SNAPSHOT
</version>
<version>
0.0.5
-SNAPSHOT
</version>
</dependency>
</dependency>
<dependency>
<dependency>
<groupId>
com.zhiwei.middleware
</groupId>
<groupId>
com.zhiwei.middleware
</groupId>
<artifactId>
rsid-client
</artifactId>
<artifactId>
cleaner-unified-urlfilter
</artifactId>
<version>
0.0.2
-SNAPSHOT
</version>
<version>
1.0
-SNAPSHOT
</version>
</dependency>
</dependency>
<dependency>
<dependency>
<groupId>
com.zhiwei.middleware
</groupId>
<groupId>
org.apache.kafka
</groupId>
<artifactId>
proxy-client
</artifactId>
<artifactId>
kafka-clients
</artifactId>
<version>
0.0.1-RELEASE
</version>
<version>
2.0.0
</version>
</dependency>
</dependency>
<dependency>
<dependency>
<groupId>
com.zhiwei
</groupId>
<groupId>
org.testng
</groupId>
<artifactId>
zhiweiTools
</artifactId>
<artifactId>
testng
</artifactId>
<version>
0.0.6-SNAPSHOT
</version>
<version>
6.14.3
</version>
<scope>
test
</scope>
</dependency>
</dependency>
</dependencies>
</dependencies>
<build>
<build>
<plugins>
<plugins>
<plugin>
<plugin>
...
@@ -145,7 +124,7 @@
...
@@ -145,7 +124,7 @@
</plugin>
</plugin>
</plugins>
</plugins>
</build>
</build>
</project>
</project>
\ No newline at end of file
src/main/java/com/zhiwei/crawler/dao/MediaDataDAO.java
deleted
100644 → 0
View file @
336b4931
package
com
.
zhiwei
.
crawler
.
dao
;
import
com.alibaba.fastjson.JSON
;
import
com.alibaba.fastjson.JSONObject
;
import
com.mongodb.DBObject
;
import
com.zhiwei.core.MongoSaveService
;
import
com.zhiwei.crawler.config.Config
;
import
com.zhiwei.crawler.util.ReadFileTools
;
/**
* @ClassName: MediaDataDAO
* @Description: TODO(数据存储)
* @author hero
* @date 2017年8月11日 上午10:49:49
*/
public
class
MediaDataDAO
{
/**生产环境**/
private
static
JSONObject
json
=
JSON
.
parseObject
(
ReadFileTools
.
getFileContent
(
"config.json"
,
"utf-8"
));
/**测试环境**/
// private static JSONObject json = JSON.parseObject(ReadFileTools.getFileContent("configTest.json", "utf-8"));
private
static
final
MongoSaveService
service
=
MongoSaveService
.
getMongoSaveService
(
json
,
Config
.
mongoIp
,
Config
.
savedbName
,
Config
.
saveCollName
,
Config
.
redisKey
);
/**
* @Title: addMediaData
* @author hero
* @Description: TODO(存储单条数据)
* @param @param doc 设定文件
* @return void 返回类型
*/
public
static
void
addMediaData
(
DBObject
doc
){
service
.
setRsidName
(
"rsid"
);
try
{
service
.
insert
(
doc
);
}
catch
(
Exception
e
)
{
e
.
printStackTrace
();
}
}
}
src/main/java/com/zhiwei/crawler/dao/ProducerKafka.java
0 → 100644
View file @
1a02bc2f
package
com
.
zhiwei
.
crawler
.
dao
;
import
java.util.Date
;
import
java.util.Properties
;
import
java.util.concurrent.Future
;
import
org.apache.kafka.clients.producer.KafkaProducer
;
import
org.apache.kafka.clients.producer.Producer
;
import
org.apache.kafka.clients.producer.ProducerRecord
;
import
org.apache.kafka.clients.producer.RecordMetadata
;
import
org.apache.logging.log4j.LogManager
;
import
org.apache.logging.log4j.Logger
;
import
com.alibaba.fastjson.JSONObject
;
import
com.mongodb.DBObject
;
import
com.zhiwei.crawler.dbtemplate.RsidClientTemplate
;
public
class
ProducerKafka
{
private
static
final
Logger
logger
=
LogManager
.
getLogger
(
ProducerKafka
.
class
);
private
static
final
String
topic
=
"crawler-test"
;
private
static
Producer
<
String
,
String
>
producer
;
static
{
if
(
producer
==
null
){
Properties
props
=
new
Properties
();
props
.
put
(
"bootstrap.servers"
,
"kafka1.irybd.com:9092"
);
props
.
put
(
"key.serializer"
,
"org.apache.kafka.common.serialization.StringSerializer"
);
props
.
put
(
"value.serializer"
,
"org.apache.kafka.common.serialization.StringSerializer"
);
props
.
put
(
"enable.auto.commit"
,
"true"
);
props
.
put
(
"compression.type"
,
"snappy"
);
// props.put("acks", "all");
// props.put("retries", 0);
// props.put("batch.size", 16384);
// props.put("linger.ms", 1);
producer
=
new
KafkaProducer
<
String
,
String
>(
props
);
}
}
public
static
void
add
(
String
spiderName
,
DBObject
doc
)
{
String
data
=
JSONObject
.
toJSONString
(
doc
);
Future
<
RecordMetadata
>
future
=
producer
.
send
(
new
ProducerRecord
<
String
,
String
>(
topic
,
spiderName
,
data
));
try
{
RecordMetadata
recMeta
=
future
.
get
();
logger
.
info
(
"添加成功。。。。。。,信息为:::{}"
,
recMeta
);
}
catch
(
Exception
e
)
{
logger
.
info
(
"添加失败,失败原因为:::{}"
,
e
);
}
}
public
static
void
main
(
String
[]
args
)
{
boolean
f
=
RsidClientTemplate
.
addFilterUrl
(
"http://www.sougou.com/15.html"
,
"测试标题222"
,
"百度"
,
new
Date
(),
"网媒"
);
System
.
out
.
println
(
"f======"
+
f
);
}
}
src/main/java/com/zhiwei/crawler/dao/WordsDao.java
View file @
1a02bc2f
...
@@ -4,8 +4,8 @@ import java.util.List;
...
@@ -4,8 +4,8 @@ import java.util.List;
import
java.util.concurrent.BlockingQueue
;
import
java.util.concurrent.BlockingQueue
;
import
java.util.concurrent.LinkedBlockingQueue
;
import
java.util.concurrent.LinkedBlockingQueue
;
import
org.
slf4j.Log
ger
;
import
org.
apache.logging.log4j.LogMana
ger
;
import
org.
slf4j.LoggerFactory
;
import
org.
apache.logging.log4j.Logger
;
import
com.mongodb.BasicDBObject
;
import
com.mongodb.BasicDBObject
;
import
com.mongodb.DBCursor
;
import
com.mongodb.DBCursor
;
...
@@ -14,7 +14,7 @@ import com.zhiwei.crawler.config.Config;
...
@@ -14,7 +14,7 @@ import com.zhiwei.crawler.config.Config;
import
com.zhiwei.crawler.dbtemplate.MongoDBTemplate
;
import
com.zhiwei.crawler.dbtemplate.MongoDBTemplate
;
public
class
WordsDao
extends
MongoDBTemplate
{
public
class
WordsDao
extends
MongoDBTemplate
{
private
static
Logger
logger
=
Log
gerFactory
.
getLogger
(
WordsDao
.
class
);
private
static
Logger
logger
=
Log
Manager
.
getLogger
(
WordsDao
.
class
);
public
WordsDao
(){
public
WordsDao
(){
super
.
setDbName
(
Config
.
crawlerdbName
);
super
.
setDbName
(
Config
.
crawlerdbName
);
...
...
src/main/java/com/zhiwei/crawler/dbtemplate/MongoDBTemplate.java
View file @
1a02bc2f
package
com
.
zhiwei
.
crawler
.
dbtemplate
;
package
com
.
zhiwei
.
crawler
.
dbtemplate
;
import
java.util.Arrays
;
import
com.mongodb.DB
;
import
com.mongodb.DB
;
import
com.mongodb.DBCollection
;
import
com.mongodb.DBCollection
;
...
@@ -29,13 +28,13 @@ public class MongoDBTemplate
...
@@ -29,13 +28,13 @@ public class MongoDBTemplate
try
{
try
{
if
(
reader
==
null
)
if
(
reader
==
null
)
{
{
reader
=
new
MongoClient
(
address
,
Arrays
.
asList
(
credential
));
//
reader = new MongoClient(address, Arrays.asList(credential));
//
reader = new MongoClient(address);
reader
=
new
MongoClient
(
address
);
}
}
if
(
writer
==
null
)
if
(
writer
==
null
)
{
{
writer
=
new
MongoClient
(
address
,
Arrays
.
asList
(
credential
));
//
writer = new MongoClient(address, Arrays.asList(credential));
//
writer = new MongoClient(address);
writer
=
new
MongoClient
(
address
);
}
}
}
catch
(
MongoException
e
)
{
}
catch
(
MongoException
e
)
{
e
.
printStackTrace
();
e
.
printStackTrace
();
...
...
src/main/java/com/zhiwei/crawler/dbtemplate/RsidClientTemplate.java
View file @
1a02bc2f
package
com
.
zhiwei
.
crawler
.
dbtemplate
;
package
com
.
zhiwei
.
crawler
.
dbtemplate
;
import
org.slf4j.Logger
;
import
java.util.Date
;
import
org.slf4j.LoggerFactory
;
import
java.util.HashMap
;
import
java.util.Map
;
import
org.apache.logging.log4j.LogManager
;
import
org.apache.logging.log4j.Logger
;
import
com.zhiwei.crawler.config.Config
;
import
com.zhiwei.crawler.config.Config
;
import
com.zhiwei.rsid.core.RsidClient
;
import
com.zhiwei.middleware.cleaner.ptenum.PTENUM
;
import
com.zhiwei.zhiweiTools.tools.ZhiWeiTools
;
import
com.zhiwei.middleware.cleaner.urlfilter.UnifiedUrlFilterClient
;
import
com.zhiwei.middleware.filter.config.Definition
;
import
com.zhiwei.tools.tools.ZhiWeiTools
;
/**
/**
* @ClassName: RsidClient
* @ClassName: RsidClient
...
@@ -16,8 +22,24 @@ import com.zhiwei.zhiweiTools.tools.ZhiWeiTools;
...
@@ -16,8 +22,24 @@ import com.zhiwei.zhiweiTools.tools.ZhiWeiTools;
*/
*/
public
class
RsidClientTemplate
{
public
class
RsidClientTemplate
{
private
static
Logger
logger
=
LoggerFactory
.
getLogger
(
RsidClientTemplate
.
class
);
private
static
Logger
logger
=
LogManager
.
getLogger
(
RsidClientTemplate
.
class
);
private
static
RsidClient
client
=
RsidClient
.
build
(
Config
.
rsidUrl
,
Config
.
rsidGroup
);
private
static
UnifiedUrlFilterClient
client
;
static
{
if
(
client
==
null
){
synchronized
(
RsidClientTemplate
.
class
)
{
if
(
client
==
null
)
{
try
{
client
=
UnifiedUrlFilterClient
.
getClient
(
Config
.
rsidUrl
,
Config
.
rsidGroup
,
Definition
.
GroupType
.
PROVIDER
);
}
catch
(
Exception
e
)
{
logger
.
error
(
"链接清洗中间件时出现错误,错误为:::{}"
,
e
);
}
}
}
}
}
/**
/**
* @Title: isMeidaExit
* @Title: isMeidaExit
...
@@ -27,12 +49,36 @@ public class RsidClientTemplate {
...
@@ -27,12 +49,36 @@ public class RsidClientTemplate {
* @param @return 设定文件
* @param @return 设定文件
* @return boolean 返回类型
* @return boolean 返回类型
*/
*/
public
static
boolean
addFilterUrl
(
String
id
){
public
static
boolean
addFilterUrl
(
String
url
,
String
title
,
String
source
,
Date
date
,
String
pt
){
for
(
int
i
=
0
;
i
<
3
;
i
++){
for
(
int
i
=
0
;
i
<
3
;
i
++){
try
{
try
{
return
client
.
addFilterUrl
(
id
,
false
,
Config
.
redisKey
);
Map
<
String
,
Object
>
filterMap
=
new
HashMap
<
String
,
Object
>();
filterMap
.
put
(
"url"
,
url
);
filterMap
.
put
(
"title"
,
title
);
filterMap
.
put
(
"source"
,
source
);
filterMap
.
put
(
"time"
,
String
.
valueOf
(
date
.
getTime
()));
filterMap
.
put
(
"pt"
,
pt
);
if
(
client
==
null
){
System
.
out
.
println
(
"client is null"
);
}
return
client
.
contains
(
filterMap
,
PTENUM
.
COMMON
);
}
catch
(
Exception
e
)
{
logger
.
error
(
"判断此条网媒消息是否存在出现问题"
,
e
);
ZhiWeiTools
.
sleep
(
500
);
continue
;
}
}
return
false
;
}
public
static
boolean
addFilterUrl
(
String
url
){
for
(
int
i
=
0
;
i
<
3
;
i
++){
try
{
return
client
.
contains
(
url
,
PTENUM
.
COMMON
);
}
catch
(
Exception
e
)
{
}
catch
(
Exception
e
)
{
logger
.
error
(
"判断此条网媒消息是否存在出现问题"
,
e
.
fillInStackTrace
()
);
logger
.
error
(
"判断此条网媒消息是否存在出现问题"
,
e
);
ZhiWeiTools
.
sleep
(
500
);
ZhiWeiTools
.
sleep
(
500
);
continue
;
continue
;
}
}
...
...
src/main/java/com/zhiwei/crawler/run/MainRun.java
View file @
1a02bc2f
...
@@ -4,6 +4,10 @@ import java.util.concurrent.Executors;
...
@@ -4,6 +4,10 @@ import java.util.concurrent.Executors;
import
java.util.concurrent.ScheduledExecutorService
;
import
java.util.concurrent.ScheduledExecutorService
;
import
java.util.concurrent.TimeUnit
;
import
java.util.concurrent.TimeUnit
;
import
com.zhiwei.crawler.config.ProxyConfig
;
import
com.zhiwei.crawler.proxy.ProxyFactory
;
import
com.zhiwei.proxy.common.Definition.GroupType
;
public
class
MainRun
{
public
class
MainRun
{
private
ScheduledExecutorService
scheduExec
;
private
ScheduledExecutorService
scheduExec
;
...
@@ -16,6 +20,9 @@ public class MainRun {
...
@@ -16,6 +20,9 @@ public class MainRun {
}
}
public
static
void
main
(
String
[]
args
)
{
public
static
void
main
(
String
[]
args
)
{
/** 初始化代理IP **/
ProxyFactory
.
init
(
ProxyConfig
.
registry
,
ProxyConfig
.
group
,
GroupType
.
PROVIDER
,
ProxyFactory:
:
getNatProxy
);
new
MainRun
().
showTimer
();
new
MainRun
().
showTimer
();
}
}
...
...
src/main/java/com/zhiwei/crawler/run/SoubaoCrawlerRun.java
View file @
1a02bc2f
...
@@ -9,7 +9,7 @@ import org.apache.logging.log4j.Logger;
...
@@ -9,7 +9,7 @@ import org.apache.logging.log4j.Logger;
import
com.zhiwei.crawler.dao.WordsDao
;
import
com.zhiwei.crawler.dao.WordsDao
;
import
com.zhiwei.crawler.soubao.SouBaoCrawlerThread
;
import
com.zhiwei.crawler.soubao.SouBaoCrawlerThread
;
import
com.zhiwei.
zhiweiT
ools.tools.ZhiWeiTools
;
import
com.zhiwei.
t
ools.tools.ZhiWeiTools
;
/**
/**
* @ClassName: BaiDuNewsCrawler
* @ClassName: BaiDuNewsCrawler
...
@@ -39,9 +39,11 @@ public class SoubaoCrawlerRun implements Runnable{
...
@@ -39,9 +39,11 @@ public class SoubaoCrawlerRun implements Runnable{
int
thread
=
5
;
int
thread
=
5
;
//美赞臣数据优先
//美赞臣数据优先
BlockingQueue
<
String
>
wordesQueue
=
wordsDao
.
getWordList
(
"美赞臣"
);
BlockingQueue
<
String
>
wordesQueue
=
wordsDao
.
getWordList
(
"美赞臣"
);
System
.
out
.
println
(
"美赞臣组关键词大小:::"
+
wordesQueue
.
size
());
//其他组数据采集关键词
//其他组数据采集关键词
BlockingQueue
<
String
>
otherWordQueue
=
wordsDao
.
getWordList
(
"-美赞臣"
);
BlockingQueue
<
String
>
otherWordQueue
=
wordsDao
.
getWordList
(
"-美赞臣"
);
wordesQueue
.
addAll
(
otherWordQueue
);
wordesQueue
.
addAll
(
otherWordQueue
);
System
.
out
.
println
(
"总的关键词大小:::"
+
wordesQueue
.
size
());
logger
.
info
(
"关键词总量为:::{}"
,
wordesQueue
.
size
());
logger
.
info
(
"关键词总量为:::{}"
,
wordesQueue
.
size
());
SouBaoCrawlerThread
[]
souBaoCrawlerThread
=
new
SouBaoCrawlerThread
[
thread
];
SouBaoCrawlerThread
[]
souBaoCrawlerThread
=
new
SouBaoCrawlerThread
[
thread
];
ExecutorService
service
=
Executors
.
newFixedThreadPool
(
2
);
ExecutorService
service
=
Executors
.
newFixedThreadPool
(
2
);
...
...
src/main/java/com/zhiwei/crawler/soubao/SouBaoCrawlerThread.java
View file @
1a02bc2f
...
@@ -6,9 +6,9 @@ import java.util.concurrent.BlockingQueue;
...
@@ -6,9 +6,9 @@ import java.util.concurrent.BlockingQueue;
import
org.apache.logging.log4j.LogManager
;
import
org.apache.logging.log4j.LogManager
;
import
org.apache.logging.log4j.Logger
;
import
org.apache.logging.log4j.Logger
;
import
com.zhiwei.crawler.proxy.ProxyFactory
;
import
com.zhiwei.crawler.run.SoubaoCrawlerRun
;
import
com.zhiwei.crawler.run.SoubaoCrawlerRun
;
import
com.zhiwei.crawler.util.ProxyClientUtil
;
import
com.zhiwei.tools.tools.ZhiWeiTools
;
import
com.zhiwei.zhiweiTools.tools.ZhiWeiTools
;
/**
/**
* @ClassName: BaiDuNewsCrawler
* @ClassName: BaiDuNewsCrawler
...
@@ -30,7 +30,7 @@ public class SouBaoCrawlerThread extends Thread{
...
@@ -30,7 +30,7 @@ public class SouBaoCrawlerThread extends Thread{
while
(
wordsQueue
!=
null
&&
wordsQueue
.
size
()>
0
){
while
(
wordsQueue
!=
null
&&
wordsQueue
.
size
()>
0
){
try
{
try
{
String
word
=
wordsQueue
.
take
();
String
word
=
wordsQueue
.
take
();
Proxy
proxy
=
Proxy
ClientUtil
.
getNAT
Proxy
();
Proxy
proxy
=
Proxy
Factory
.
proxyCallback
().
get
Proxy
();
/***开始采集**/
/***开始采集**/
logger
.
info
(
"开始采集:{}搜报网关键词,目前未采集的关键词为:{}"
,
word
,
wordsQueue
.
size
());
logger
.
info
(
"开始采集:{}搜报网关键词,目前未采集的关键词为:{}"
,
word
,
wordsQueue
.
size
());
long
s
=
System
.
currentTimeMillis
();
long
s
=
System
.
currentTimeMillis
();
...
...
src/main/java/com/zhiwei/crawler/util/ProxyClientUtil.java
deleted
100644 → 0
View file @
336b4931
package
com
.
zhiwei
.
crawler
.
util
;
import
java.net.Proxy
;
import
com.zhiwei.crawler.config.ProxyConfig
;
import
com.zhiwei.proxy.common.Definition.GroupType
;
import
com.zhiwei.proxy.core.ProxyClient
;
import
com.zhiwei.proxy.core.ProxyClientFactory
;
public
class
ProxyClientUtil
{
private
static
volatile
ProxyClient
client
;
/**
* @Title: getNATProxy
* @author hero
* @Description: 获取NAT机代理IP
* @param @return 设定文件
* @return Proxy 返回类型
*/
public
static
Proxy
getNATProxy
(){
if
(
client
==
null
)
{
synchronized
(
ProxyClientUtil
.
class
)
{
if
(
client
==
null
)
{
client
=
ProxyClientFactory
.
build
(
ProxyConfig
.
registry
,
ProxyConfig
.
group
,
GroupType
.
PROVIDER
);
}
}
}
return
client
.
getNATProxy
();
}
}
src/main/java/com/zhiwei/crawler/util/ReadFileTools.java
deleted
100644 → 0
View file @
336b4931
/**
* Copyright © 2017宁波知微瑞驰信息科技有限公司. All rights reserved.
* @Title: Tools.java
* @Package com.zhiwei.demo
* @author 0xFF
* @date 2017年8月2日 下午5:45:34
*/
package
com
.
zhiwei
.
crawler
.
util
;
import
java.io.BufferedInputStream
;
import
java.io.ByteArrayOutputStream
;
/**
* @ClassName: Tools
* @Description: 工具类
* @author 0xFF
* @date 2017年8月2日 下午5:45:34
*/
public
class
ReadFileTools
{
/**
* 读取文本文件内容
* @param path
* @param encode
* @return String
*/
public
static
String
getFileContent
(
String
path
,
String
encode
)
{
String
result
=
null
;
ByteArrayOutputStream
baos
=
null
;
try
(
BufferedInputStream
bis
=
new
BufferedInputStream
(
ReadFileTools
.
class
.
getClassLoader
().
getResourceAsStream
(
path
)))
{
byte
[]
buf
=
new
byte
[
1024
];
baos
=
new
ByteArrayOutputStream
();
int
count
;
while
((
count
=
bis
.
read
(
buf
))
!=
-
1
)
{
baos
.
write
(
buf
,
0
,
count
);
}
result
=
new
String
(
baos
.
toByteArray
(),
encode
);
}
catch
(
Exception
e
)
{}
return
result
;
}
}
src/main/java/com/zhiwei/crawler/util/TreatData.java
View file @
1a02bc2f
...
@@ -8,9 +8,9 @@ import org.slf4j.LoggerFactory;
...
@@ -8,9 +8,9 @@ import org.slf4j.LoggerFactory;
import
com.mongodb.BasicDBObject
;
import
com.mongodb.BasicDBObject
;
import
com.mongodb.DBObject
;
import
com.mongodb.DBObject
;
import
com.zhiwei.crawler.dao.
MediaDataDAO
;
import
com.zhiwei.crawler.dao.
ProducerKafka
;
import
com.zhiwei.crawler.dbtemplate.RsidClientTemplate
;
import
com.zhiwei.crawler.dbtemplate.RsidClientTemplate
;
import
com.zhiwei.
zhiweiTools.timeP
arse.TimeParse
;
import
com.zhiwei.
tools.timep
arse.TimeParse
;
/**
/**
* @ClassName: TreatData
* @ClassName: TreatData
...
@@ -31,25 +31,31 @@ public class TreatData {
...
@@ -31,25 +31,31 @@ public class TreatData {
* @return void 返回类型
* @return void 返回类型
*/
*/
public
static
void
treatDataAccount
(
Map
<
String
,
Object
>
dataMap
)
{
public
static
void
treatDataAccount
(
Map
<
String
,
Object
>
dataMap
)
{
if
(!(
dataMap
.
get
(
"_id"
)
==
null
||
dataMap
.
get
(
"_id"
).
equals
(
""
)
||
dataMap
.
get
(
"source"
)
==
null
if
(!(
dataMap
.
get
(
"_id"
)
==
null
||
dataMap
.
get
(
"_id"
).
equals
(
""
)
||
dataMap
.
get
(
"source"
).
equals
(
""
))
||
dataMap
.
get
(
"time"
)!=
null
)
{
||
dataMap
.
get
(
"source"
)
==
null
||
dataMap
.
get
(
"source"
).
equals
(
""
)
||
dataMap
.
get
(
"time"
)==
null
||
dataMap
.
get
(
"time"
).
equals
(
""
)
||
dataMap
.
get
(
"title"
).
equals
(
""
)||
dataMap
.
get
(
"title"
)==
null
)
)
{
Date
now
=
new
Date
(
new
Date
().
getTime
()
-
49
*
60
*
60
*
1000
);
//避免时间为:2018-06-22 00:00:00 时间格式的数据丢掉
Date
now
=
new
Date
(
new
Date
().
getTime
()
-
49
*
60
*
60
*
1000
);
//避免时间为:2018-06-22 00:00:00 时间格式的数据丢掉
Date
date
=
TimeParse
.
stringFormartDate
(
dataMap
.
get
(
"time"
).
toString
());
Date
date
=
TimeParse
.
stringFormartDate
(
dataMap
.
get
(
"time"
).
toString
());
if
(
date
.
after
(
now
))
{
if
(
date
.
after
(
now
))
{
logger
.
info
(
"去重的链接为:{}"
,
dataMap
.
get
(
"_id"
));
logger
.
info
(
"去重的链接为:{}"
,
dataMap
.
get
(
"_id"
));
String
url
=
dataMap
.
get
(
"_id"
).
toString
();
String
url
=
dataMap
.
get
(
"_id"
).
toString
();
if
(
RsidClientTemplate
.
addFilterUrl
(
url
)){
String
title
=
dataMap
.
get
(
"title"
).
toString
();
String
source
=
dataMap
.
get
(
"source"
).
toString
();
if
(!
RsidClientTemplate
.
addFilterUrl
(
url
,
title
,
source
,
date
,
"平媒"
)){
DBObject
doc
=
new
BasicDBObject
();
DBObject
doc
=
new
BasicDBObject
();
doc
.
put
(
"_id"
,
url
);
doc
.
put
(
"_id"
,
url
);
doc
.
put
(
"url"
,
url
);
doc
.
put
(
"url"
,
url
);
doc
.
put
(
"title"
,
dataMap
.
get
(
"title"
));
doc
.
put
(
"title"
,
dataMap
.
get
(
"title"
));
doc
.
put
(
"time"
,
date
);
doc
.
put
(
"time"
,
String
.
valueOf
(
date
.
getTime
())
);
doc
.
put
(
"source"
,
dataMap
.
get
(
"source"
));
doc
.
put
(
"source"
,
dataMap
.
get
(
"source"
));
doc
.
put
(
"content"
,
dataMap
.
get
(
"content"
)+
""
.
replaceAll
(
"<[^>]*>"
,
""
));
doc
.
put
(
"content"
,
dataMap
.
get
(
"content"
)+
""
.
replaceAll
(
"<[^>]*>"
,
""
));
doc
.
put
(
"pt"
,
"平媒"
);
doc
.
put
(
"pt"
,
"平媒"
);
doc
.
put
(
"type"
,
"平媒"
);
doc
.
put
(
"type"
,
"平媒"
);
doc
.
put
(
"savetime"
,
System
.
currentTimeMillis
());
doc
.
put
(
"savetime"
,
System
.
currentTimeMillis
());
MediaDataDAO
.
addMediaData
(
doc
);
ProducerKafka
.
add
(
"搜报网"
,
doc
);
}
else
{
}
else
{
logger
.
info
(
"搜报网地址为:{},此条数据重复"
,
dataMap
.
get
(
"_id"
));
logger
.
info
(
"搜报网地址为:{},此条数据重复"
,
dataMap
.
get
(
"_id"
));
}
}
...
...
src/main/resources/config.json
deleted
100644 → 0
View file @
336b4931
{
"registry"
:
"zookeeper://192.168.0.203:2181;zookeeper://192.168.0.104:2181;zookeeper://192.168.0.105:2181"
,
"group"
:
"mongosaveservice"
,
"shardings"
:
{
"enforce"
:
false
,
"settings"
:
[
{
"collection"
:
"video"
,
"method"
:
"month"
,
"field"
:
"time"
}
]
}
}
src/main/resources/configTest.json
deleted
100644 → 0
View file @
336b4931
{
"registry"
:
"zookeeper://192.168.0.234:2181"
,
"group"
:
"mongosaveservice"
,
"shardings"
:
{
"enforce"
:
false
,
"settings"
:
[
{
"collection"
:
"video"
,
"method"
:
"month"
,
"field"
:
"time"
}
]
}
}
src/main/resources/db.properties
View file @
1a02bc2f
#####################生产环境#################################
#####################生产环境#################################
mongoIp
=
192.168.0.101
#mongoIp=192.168.0.101
mongoPort
=
30000
#mongoPort=30000
db.username
=
zzwno
db.paasword
=
zzwno1q2w3e4r
db.certifiedDB
=
admin
##save data dbInfo
savedbName
=
mediaspider
saveCollName
=
net_media
##crawler word dbInfo
crawlerdbName
=
qbjcPhoenix
crawlerCollName
=
qbjc_crawlerword
rsid.zookeeper.url
=
zookeeper://192.168.0.203:2181;zookeeper://192.168.0.104:2181;zookeeper://192.168.0.105:2181
rsid.zookeeper.group
=
rsidservernew
redisKey
=
media
#####################测试环境#################################
#mongoIp=192.168.0.233
#mongoPort=27017
#db.username=zzwno
#db.username=zzwno
#db.paasword=zzwno1q2w3e4r
#db.paasword=zzwno1q2w3e4r
#db.certifiedDB=admin
#db.certifiedDB=admin
...
@@ -25,6 +10,21 @@ redisKey=media
...
@@ -25,6 +10,21 @@ redisKey=media
###crawler word dbInfo
###crawler word dbInfo
#crawlerdbName=qbjcPhoenix
#crawlerdbName=qbjcPhoenix
#crawlerCollName=qbjc_crawlerword
#crawlerCollName=qbjc_crawlerword
#rsid.zookeeper.url = zookeeper://192.168.0.2
34:2181;
#rsid.zookeeper.url = zookeeper://192.168.0.2
03:2181;zookeeper://192.168.0.104:2181;zookeeper://192.168.0.105:2181
#rsid.zookeeper.group=rsidservernew
#rsid.zookeeper.group=rsidservernew
#
redisKey
=
media
#redisKey=media
\ No newline at end of file
#####################测试环境#################################
mongoIp
=
127.0.0.1
mongoPort
=
27017
db.username
=
zzwno
db.paasword
=
zzwno1q2w3e4r
db.certifiedDB
=
admin
##save data dbInfo
savedbName
=
mediaspider
saveCollName
=
net_media
##crawler word dbInfo
crawlerdbName
=
qbjcPhoenix
crawlerCollName
=
qbjc_crawlerword
rsid.zookeeper.url
=
zookeeper://192.168.0.36:2181;
rsid.zookeeper.group
=
local
redisKey
=
media
\ No newline at end of file
src/main/resources/proxyip.properties
View file @
1a02bc2f
registry
=
zookeeper://192.168.0.203:2181;zookeeper://192.168.0.104:2181;zookeeper://192.168.0.105:2181
#
registry=zookeeper://192.168.0.203:2181;zookeeper://192.168.0.104:2181;zookeeper://192.168.0.105:2181
group
=
hangzhou
#
group=hangzhou
########################################################
########################################################
#registry=zookeeper://202.107.192.94:2181
registry
=
zookeeper://192.168.0.36:2181
#
group
=
local
group
=
local
\ No newline at end of file
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment