`

heritrix多线程抓取--好使

阅读更多
最近作业中有个需要用Heritrix抓包的任务,不过抓起来,我真的崩溃了。用我的电脑抓了奖金20个小时,抓了50M。都哭了。不过发现那个active threads项最多只有一个,很多时候都是0。偶表示压力很大。。 怎么搞的??
听朋友说,加上网上搜资料,终于搞定,原来Heritrix采用HostnameQueueAssignmentPolicy来进行对URL处理。url队列以hostname为key,所有相同key的url放置在同一个队列里面,也就是说同一个host下面的所有url都放在一个队列里面,当线程获取url时候,会将该队列放置到同步池中,拒绝其他线程访问。觉得说的有道理,嘿嘿。按照如下步骤进行了尝试,果然,好使。

1. 添加一个新类ELFHashQueueAssignmentPolicy.java
package org.archive.crawler.frontier;

import java.util.logging.Level;
import java.util.logging.Logger;

import org.apache.commons.httpclient.URIException;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.framework.CrawlController;
import org.archive.net.UURI;
import org.archive.net.UURIFactory;

public class ELFHashQueueAssignmentPolicy extends QueueAssignmentPolicy {

	private static final Logger logger = Logger
			.getLogger(ELFHashQueueAssignmentPolicy.class.getName());

	private static String DEFAULT_CLASS_KEY = "default...";

	private static final String DNS = "dns";

	@Override
	public String getClassKey(CrawlController controller, CandidateURI cauri) {
		String uri = cauri.getUURI().toString();
		String scheme = cauri.getUURI().getScheme();
		String candidate = null;

		try {
			if (scheme.equals(DNS)) {
				if (cauri.getVia() != null) {
					UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia());
					candidate = viaUuri.getAuthorityMinusUserinfo();
					scheme = viaUuri.getScheme();
				} else {
					candidate = cauri.getUURI().getReferencedHost();
				}
			} else {
				long hash = ELFHash(uri);
				candidate = Long.toString(hash % 100);
			}

			if (candidate == null || candidate.length() == 0) {
				candidate = DEFAULT_CLASS_KEY;
			}
		} catch (URIException e) {
			logger.log(Level.INFO, "unable to extract class key; using default", e);
			candidate = DEFAULT_CLASS_KEY;
		}
		if (scheme != null && scheme.equals(UURIFactory.HTTPS)) {
            if (!candidate.matches(".+:[0-9]+")) {
                candidate += UURIFactory.HTTPS_PORT;
            }
        }

		return candidate.replace(':', '#');
	}

	public static long ELFHash(String str) {
		long hash = 0;
		long x = 0;
		for (int i = 0; i < str.length(); i++) {
			hash = (hash << 4) + str.charAt(i);
			if ((x = hash & 0xF0000000L) != 0) {
				hash ^= (x >> 24);
				hash &= ~x;
			}
		}
		return (hash & 0x7FFFFFFF);
	}
}


2. 修改AbstractFrontier(跟ELFHashQueueAssignmentPolicy.java在同一个包下)
// Read the list of permissible choices from heritrix.properties.
        // Its a list of space- or comma-separated values.
        String queueStr = System.getProperty(AbstractFrontier.class.getName() +
                "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
                ELFHashQueueAssignmentPolicy.class.getName() + " " +//修改之后ELFHash队列分配策略
                IPQueueAssignmentPolicy.class.getName() + " " +
                BucketQueueAssignmentPolicy.class.getName() + " " +
                SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
                TopmostAssignedSurtQueueAssignmentPolicy.class.getName());
        Pattern p = Pattern.compile("\\s*,\\s*|\\s+");

3. 修改heritrix.properties属性(在conf包下)
#############################################################################
# FRONTIER
#############################################################################

# List here all queue assignment policies you'd have show as a
# queue-assignment-policy choice in AbstractFrontier derived Frontiers
# (e.g. BdbFrontier).
org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
    org.archive.crawler.frontier.ELFHashQueueAssignmentPolicy \
    org.archive.crawler.frontier.IPQueueAssignmentPolicy \
    org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
    org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
    org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy
org.archive.crawler.frontier.BdbFrontier.level = INFO


按照如上说明,搞定!!
通过以上配置,有时还是会出问题,虽然不知道为什么,但是还是试了很多方法,解决掉了。
(1) 配置下在Setting里的frontier项中的max retries,改成100(有可能是入口过少)
(2) 将url地址改成ip地址(看过log,有时候会有很多404error,那我直接换成ip地址试下,果然好使,哈哈)

不过有的还是不好使 唉 望有识之士帮忙确定下上面的修改是否能够100%成功!!
  • 大小: 60.9 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics