Aug 14 2013
By
enira | Filled under:
Java,
Webscraping
Introduction
So occasionally I want to scrape a web page of links or images. Being a programmer I hate such a tedious tags and I always end up writing a script. In the past I always used a combination of substring(), indexOf or some other string formatting functions. However websites are also/still XML files! This enables a much easier method to search content: XPATH.
This little tutorial will handle XPATH queries on a plain HTML page in Java. I am using an adapted tutorial that can be found at: http://manual.calibre-ebook.com/xpath.html.
Setup
The Java library that I will use in this tutorial is HtmlCleaner. HtmlCleaner is open-source HTML parser written in Java and it cleans up any ill written HTML code. You can download it at: http://htmlcleaner.sourceforge.net/download.php.
Right click and go to ‘Properties’. Go to the tab ‘Libraries’ and press ‘Add JARs’.
I also provided a test file. In the project you can find it in the ‘src/resources’ folder. The function readDocument() will read this file and create a usable Document object.
private Document readDocument() {
String content = null;
try {
content = FileUtils
.readLargeTextFileUTF8("src/resources/index.html");
} catch (IOException e) {
}
TagNode tagNode = new HtmlCleaner().clean(content);
Document doc = null;
try {
doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
return doc;
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
return null;
} |
private Document readDocument() {
String content = null;
try {
content = FileUtils
.readLargeTextFileUTF8("src/resources/index.html");
} catch (IOException e) {
}
TagNode tagNode = new HtmlCleaner().clean(content);
Document doc = null;
try {
doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
return doc;
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
This function will be used by all the run examples.
Selecting by tagname
So this first example will select all h2 tags from the top level. Please note: ‘//’ defines an element directly one level from the root. HtmlCleaner considers the body tag as the root of the document.
private void run1(Document document) {
System.out.println("run1:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//h2", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run1(Document document) {
System.out.println("run1:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//h2", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
run1:
Chapter One
Chapter Two |
run1:
Chapter One
Chapter Two
This next example will search all ‘div’ elements and show the p tags inside the div. Note: This example takes only the div tags on the top level!
private void run2(Document document) {
System.out.println("run2:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//div/p", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run2(Document document) {
System.out.println("run2:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//div/p", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
run2:
A very short ebook to demonstrate the use of XPath. |
run2:
A very short ebook to demonstrate the use of XPath.
This next example will get also the same result as the second run, however in this case the div element can be any nested div child, and doesn’t needs to be a sub child from the root node.
private void run3(Document document) {
System.out.println("run3:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//*/div/p", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run3(Document document) {
System.out.println("run3:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//*/div/p", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
run3:
A very short ebook to demonstrate the use of XPath. |
run3:
A very short ebook to demonstrate the use of XPath.
This next example will select based on name of the element. In this case XPATH will select all <h1> and <h2> tags.
private void run4(Document document) {
System.out.println("run4:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate(
"//*[name()='h1' or name()='h2']", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run4(Document document) {
System.out.println("run4:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate(
"//*[name()='h1' or name()='h2']", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
run4:
A very short ebook
Chapter One
Chapter Two |
run4:
A very short ebook
Chapter One
Chapter Two
Selecting by attributes
In XPATH it is also possible to select based on attributes. Next example selects the content of all tags containing an attribute style=”.
private void run5(Document document) {
System.out.println("run5:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//*[@style]", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run5(Document document) {
System.out.println("run5:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//*[@style]", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
run5:
Written by Kovid Goyal |
run5:
Written by Kovid Goyal
Next example will select all chapter classes. This is done by adding a class selector like in run5, but specify it to match only ‘chapter’ attribute values.
private void run6(Document document) {
System.out.println("run6:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//*[@class='chapter']",
document, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run6(Document document) {
System.out.println("run6:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate("//*[@class='chapter']",
document, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
run6:
Chapter One
Chapter Two |
run6:
Chapter One
Chapter Two
Now let’s get a little bit advanced: next example will select all <h1> tags which have a class named ‘bookTitle’. The xpath query reads: From the top level select all h1 elements with an attribute class that matches the value ‘bookTitle’.
private void run7(Document document) {
System.out.println("run7:");
XPath xpath = XPathFactory.newInstance().newXPath();
String str;
try {
str = (String) xpath.evaluate("//h1[@class='bookTitle']", document,
XPathConstants.STRING);
System.out.println(str);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run7(Document document) {
System.out.println("run7:");
XPath xpath = XPathFactory.newInstance().newXPath();
String str;
try {
str = (String) xpath.evaluate("//h1[@class='bookTitle']", document,
XPathConstants.STRING);
System.out.println(str);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
Selecting by tag content
XPATH is also able to select content based on text containing a certain value. In run8 will show you how to select all <h2> tags containing the text string ‘One’. Note: XPath is case sensitive!
private void run8(Document document) {
System.out.println("run8:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate(
"//h2[contains(., 'One')]", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
} |
private void run8(Document document) {
System.out.println("run8:");
XPath xpath = XPathFactory.newInstance().newXPath();
try {
NodeList nodes = (NodeList) xpath.evaluate(
"//h2[contains(., 'One')]", document,
XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getTextContent());
}
} catch (XPathExpressionException e) {
e.printStackTrace();
}
System.out.println("");
}
Output:
Downloads:
Example project (Eclipse): Scraping01.zip (132 KB)