enira.net

We are all living in a technological wasteland.

RSS
people

Webscraping – Part 2 – Basic XPath HTML queries in JAVA

Introduction

So, this tutorial continues to build on the first part. If you haven’t read this, I recommend to do so. As it contains vital information that I won’t repeat in this post.

Note: please download and open the eclipse project. The run methods can be found in the Scraping02 class.

Additional queries

Sometimes it is possible that the HTML structure is so fucked up that you need to take matters in your own hand. In run1 I’ll show you how to loop trough all elements of the images table one by one.

So ‘node.getFirstChild()’ is the <td> tag, the ‘node.getFirstChild().getNextSibling()’ is the first <img> tag. To loop trough you will need to get ‘next.getNextSibling().getNextSibling()’ as the first nextSibling is the current element.

Example:

private void run1(Document document) {
	System.out.println("run1:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		Node node = (Node) xpath.evaluate("//*[@class='images']//td",
				document, XPathConstants.NODE);
 
		Node next = node.getFirstChild().getNextSibling();
		do {
			String image = next.getAttributes().getNamedItem("src")
					.toString();
 
			// result is: src="image1.jpg", so split
			System.out.println(image.split("\"")[1]);
 
			next = next.getNextSibling().getNextSibling();
		} while (next != null);
 
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run1:
image1.jpg
image2.jpg
image3.jpg
image4.jpg
image5.jpg

This hasn’t been covered in the past but it is quite useful for web scraping. XPath queries can select the attribute values too. This can be done using the ‘@’ character. This comes quite handy when selecting links from <a href=””> tags. In run2 the query selects the links (href attribute) from all <a>tags in the tag defined by a class named ‘links’.

private void run2(Document document) {
	System.out.println("run2:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate(
				"//*[@class='links']//a/@href", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run2:
http://enira.net/links/link1.htm
http://enira.net/links/link2.htm
http://enira.net/links/link3.htm
http://enira.net/links/link4.htm

In run3 I’ve combined run2 together with run1. Here you can see that HtmlCleaner corrects thags that aren’t correct. (In this case the <img> tag). The XPath query reads: select all elements with a class ‘images, from this class select the ‘img’ element with a td parent at any level/depth and get the src attribute.

private void run3(Document document) {
	System.out.println("run3:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate(
				"//*[@class='images']//td/img/@src", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run3:
image1.jpg
image2.jpg
image3.jpg
image4.jpg
image5.jpg

So that’s about it, now you should be able to scrape the web quite easily. I think I’ve covered enough for you to get started scraping pages using XPath.

Downloads:

Example project (Eclipse): Scraping02.zip (132 KB)

Leave a Reply

You must be logged in to post a comment.