你知道的越多,你不知道的越多
点赞再看,养成习惯
如果您有疑问或者见解,欢迎指教:
企鹅:869192208
前言
最近遇到一个需求,项目需要使用到全国统计用区划代码和城乡划分代码,并且要获取到省-市-区县-镇街-村居五级数据。但是我在官方渠道没有搜索到完整的资源,没办法,只能研究一下爬虫的方式爬取,Jsoup 就在此时派上用场。使用 Jsoup 获取到数据之后,希望将其存储到 excel 文件中,此时可以使用 easyExcel 去快速实现。
但是由于区县和镇街的数据过于庞大,单靠一个 ip 去爬取数据,在爬取一定数据(8000多条)后,会限制访问,解决思路是使用代理,基本上要付费代理才比较稳定,所以虽然代码层面能够实现获取五级区划,但是实际上是没有实现的,具体代码实现往下看。
引入jar包
首先,我们需要引入 Jsoup 的 jar 包和 easyExcel 的 jar 包,本次 Jsoup 使用的是 1.9.2 版本,easyExcel 使用 3.3.2 版本,只需要在 pom 文件新增以下内容:
<!-- jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.9.2</version>
</dependency>
<!-- easyexcel -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>3.3.2</version>
</dependency>
<!-- fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.9</version>
</dependency>
<!-- lombok -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<scope>provided</scope>
</dependency>
代码实现
- 新建一个 main 方法
public static void main(String[] args) {
JavaJsoupUtil util = new JavaJsoupUtil();
//省
List<SysCitys> sysAreas = util.getProvinces();
System.out.println(sysAreas.size());
String fileName = "D:/全国" + ".xlsx";
EasyExcel.write(fileName, SysCitys.class).sheet("一次性导出结果").doWrite(sysAreas);
}
- 新建 SysCitys类,用来存储获取到的每个区划数据
@Data
@AllArgsConstructor
@NoArgsConstructor
public class SysCitys {
private String addrCode;
private String addrName;
private String fatherCode;
private String type;
}
- 新建 JavaJsoupUtil 类,使用 Jsoup 抓取并解析数据
@Slf4j
public class JavaJsoupUtil {
/**
* 公共路径url
*/
private static String url = "http://www.stats.gov.cn/sj/tjbz/tjyqhdmhcxhfdm/2021/";
/**
* 建立连接
*/
private static Document connect(String url) {
if (url == null || url.isEmpty()) {
throw new IllegalArgumentException("无效的url");
}
try {
return Jsoup.connect(url).timeout(1000 * 60).get();
} catch (IOException e) {
log.error(url+"地址不存在或连接超时"+e.getMessage()+e);
return null;
}
}
/**
* 获取所有的省份
* @param
* @return
*/
public List<SysCitys> getProvinces() {
List<SysCitys> sysAreas = new ArrayList<>();
Document connect = connect(url+"index.html");
Elements rowProvince = connect.select("tr.provincetr");
for (Element provinceElement : rowProvince) {
Elements select = provinceElement.select("a");
for (Element province : select) {
String codUrl = province.select("a").attr("href");
String fatherCode = codUrl.replace(".html", "0000");
String name = province.text();
SysCitys sysCitys = returnCitys(fatherCode, name, "0000", "01");
sysAreas.add(sysCitys);
log.info("++++++++++++++++++++++++++开始获取" + name + "下属市区行政区划信息++++++++++++++++++++++++");
String provinceUrl = url + codUrl;
List<SysCitys> sysAreasList = getCitys(provinceUrl, fatherCode);
sysAreas.addAll(sysAreasList);
}
}
return sysAreas;
}
/**
* 获取市行政区划信息
* @param provinceUrl 省份对应的地址
* @param parentCode 需要爬取的省份行政区划(对于市的父级代码即为省行政区划)
* @return
*/
public List<SysCitys> getCitys(String provinceUrl,String parentCode){
List<SysCitys> sysAreas = new ArrayList<>();
//System.out.println("provinceUrl:" + provinceUrl);
Document connect = connect(provinceUrl);
if (null == connect){
connect = connect(provinceUrl);
}
Elements rowCity = connect.select("tr.citytr");
for (Element cityElement : rowCity) {
String codUrl = cityElement.select("a").attr("href");
String name = cityElement.select("td").text();
String[] split = name.split(" ");
String addrCode = split[0];
SysCitys sysCitys = returnCitys(addrCode,split[1],parentCode,"02");
sysAreas.add(sysCitys);
log.error("-------------------开始获取"+split[1]+"下属区县行政区划信息-----------------------");
/*try {
Thread.sleep((long)(Math.random() * 1000));
} catch (InterruptedException e) {
e.printStackTrace();
}*/
String cityUrl = url+codUrl;
List<SysCitys> downAreaCodeList = getCountys(cityUrl,addrCode);
sysAreas.addAll(downAreaCodeList);
}
return sysAreas;
}
/**
* 获取区县行政区划信息
* @param cityUrl 城市对应的地址
* @param parentCode 需要爬取的市行政区划(对于区县的父级代码即为市行政区划)
* @return
*/
public List<SysCitys> getCountys(String cityUrl,String parentCode){
List<SysCitys> sysAreas = new ArrayList<>();
//System.out.println("cityUrl:" + cityUrl);
Document connect = connect(cityUrl);
if (null == connect){
connect = connect(cityUrl);
}
Elements rowDown = connect.select("tr.countytr");
for (Element downElement : rowDown) {
String codUrl = downElement.select("a").attr("href");
String name = downElement.select("td").text();
String[] split = name.split(" ");
String addrCode = split[0];
if(!"市辖区".equals(split[1]) && !"金门县".equals(split[1])){
SysCitys sysCitys = returnCitys(addrCode,split[1],parentCode,"03");
sysAreas.add(sysCitys);
log.info("-------------------开始获取"+split[1]+"下属镇街行政区划信息-----------------------");
/*try {
Thread.sleep((long)(Math.random() * 1000));
} catch (InterruptedException e) {
e.printStackTrace();
}*/
String countyUrl = StringUtils.substringBeforeLast(cityUrl, "/")+"/"+codUrl;
List<SysCitys> downAreaCodeList = getTowns(countyUrl,addrCode);
sysAreas.addAll(downAreaCodeList);
}
}
return sysAreas;
}
/**
* 获取镇街行政区划信息
* @param countyUrl 区县对应的地址
* @param parentCode 需要爬取的区县行政区划(对于镇街的父级代码即为市行政区划)
* @return
*/
public List<SysCitys> getTowns(String countyUrl,String parentCode){
List<SysCitys> sysAreas = new ArrayList<>();
//System.out.println("countyUrl:" + countyUrl);
Document connect = connect(countyUrl);
if (null == connect){
connect = connect(countyUrl);
}
Elements rowDown = connect.select("tr.towntr");
for (Element downElement : rowDown) {
String codUrl = downElement.select("a").attr("href");
String name = downElement.select("td").text();
String[] split = name.split(" ");
String addrCode = split[0];
SysCitys sysCitys = returnCitys(addrCode,split[1],parentCode,"04");
sysAreas.add(sysCitys);
log.error("-------------------开始获取"+split[1]+"下属村居行政区划信息-----------------------");
/*try {
Thread.sleep((long)(Math.random() * 1000));
} catch (InterruptedException e) {
e.printStackTrace();
}*/
String townUrl = StringUtils.substringBeforeLast(countyUrl, "/")+"/"+codUrl;
List<SysCitys> downAreaCodeList = getVillages(townUrl,addrCode);
sysAreas.addAll(downAreaCodeList);
}
return sysAreas;
}
/**
* 获取村居行政区划信息
* @param townUrl 镇街对应的地址
* @param parentCode 需要爬取的镇街行政区划(对于镇街的父级代码即为市行政区划)
* @return
*/
public List<SysCitys> getVillages(String townUrl,String parentCode){
List<SysCitys> sysAreas = new ArrayList<>();
//System.out.println("townUrl:" + townUrl);
Document connect = connect(townUrl);
if (null == connect){
connect = connect(townUrl);
}
Elements rowDown = connect.select("tr.villagetr");
for (Element downElement : rowDown) {
String name = downElement.select("td").text();
String[] split = name.split(" ");
String addrCode = split[0];
SysCitys sysCitys = returnCitys(addrCode,split[2],parentCode,"05");
sysAreas.add(sysCitys);
}
return sysAreas;
}
/**
* 返回城市对象
* @param addrCode
* @param addrName
* @param fatherCode
* @return
*/
private SysCitys returnCitys(String addrCode,String addrName,String fatherCode,String type){
SysCitys sysCitys = new SysCitys();
sysCitys.setAddrCode(addrCode);
sysCitys.setAddrName(addrName);
sysCitys.setFatherCode(fatherCode);
sysCitys.setType(type);
return sysCitys;
}
}
虽然没能获取到完整的五级区划,但是省市区三级区划还是可以获取到的,毕竟数据只要三千多条,后续再记录一下折腾的第二种爬虫方式获取数据。