enira.net

We are all living in a technological wasteland.

RSS
people

NAS – Part 1: The hardware

Introduction

So in the past I’ve always assembled NAS devices from old PC’s. My previous NAS device was an old Intel Q6600 processor. This gave me a thermal design power of 105 Watt. Ouch! Time to upgrade for something much power efficient. http://ark.intel.com/products/77987 this little quad core has an octa core and 20 Watt of TDP.

Hardware

- ASRock C2750D4I
- Lian Li PC-Q25B case
- 2 x 2GB Mushkin DDR3 RAM
- 4 x 3TB disk drives
- 2 x 1TB disk drives
- 1 x 60 GB SSD disk drive
- 1 x 1.5 TB disk drive

Setup

My data disks will be set up according to the following specification:
- one RAID5 (three 3TB dives) for semi critical data;
- one RAID1 (two 1TB drives) for critical files that I never want to lose (backups);
- one 3 TB disk used for files that may be lost (virtual machines and such).

The OS will be installed on the SSD drive to provide a fast and stable system. Right now I’ve also included an old 1.5TB disc drive to the system, in the future I wish to replace this by another 3TB when the need for extra data on RAID5 arises.

All RAID configurations will be handled by software raid. There is no sane reason why you should spend too much money for a hardware RAID on a home NAS system. My system  using software RAID performs at least at 83MB/s (picture taken from when system was done, copying external disk to NAS).

DSC_0166

Building the system

I don’t really want to explain this. It’s pretty straightforward. If you can’t read a manual or build your system then I recommend closing this page and buying a Synology instead. But for those interested: here are some juicy pictures of me building my NAS .

Unpacking the little motherboard (yes a passive cooled octa-core).

DSC_0149 DSC_0150

Clearing out my old NAS system (the Q6600 PC).

DSC_0152 DSC_0151

The final result before closing the case. I simply love the 5 hot swappable bays.

DSC_0162 DSC_0160

Booting the NAS with my little 10″ debug screen. One of the best purchases I’ve ever made in the past.

DSC_0158

Next part will handle all the services and software.

 

No Comments |

Amazon ephemeral: swap on boot.

Introduction

With AWS you can add ephemeral storage to an EC2 instance. The downside of this storage is that it’s gone once you reboot the machine. This makes it perfectly suitable for swapspace. However you can’t add this swap to your /etc/fstab file. It will block the booting of your EC2 instance, as ephemeral storage will always reset. Any swap partitions assigned will vanish.

My way of coping with this is creating a little service that creates the swap at boot time.

Code

Let’s create a service named ‘swapon’.

sudo nano /etc/init.d/swapon

And add the following content:

# chkconfig: 2345 95 20
# description: Adding swap to ephemeral0
# Adding swap to ephemeral0
# processname: swap
dd if=/dev/zero of=/media/ephemeral0/swap bs=1024 count=8M
mkswap /media/ephemeral0/swap
chown root:root /media/ephemeral0/swap
chmod 600 /media/ephemeral0/swap
swapon /media/ephemeral0/swap

This service will generate a swap file of 8GB (hence the 8M * 1024byte). If you need less or more, change the ‘count’ parameter.

Now last step: let’s assign execute rights and enable this script/service at boot time.

sudo chmod +x /etc/init.d/swapon
sudo chkconfig --level 345 swapon on

Now your swap will be created every time you boot (or reboot) your EC2 instance. The only drawback of this method is that it will run in the background and your swap won’t be immediately available. But adding swap won’t block your machine boot process.

Example: my m1.small instance it takes about 5 minutes to create an 8GB swap file.

No Comments |

Where’s my ephemeral storage?

Introduction

By default Amazon Webservices doesn’t assigns the ephemeral storage. I don’t know whether or not this is done to favor their EBS products. But once you created a machine. It’s a real pain in the ass to enable it. At the moment of writing there is no way to add ephemeral storage to an existing EC2 instance. The only way I found to do this is to migrate the instance to a new one.

Procedure

Step 1: Shutdown the EC2 instance you wish to migrate.
Step 2: Write down the associated and attached EBS volumes from the EC2 instance.
Step 3: In the EC2 dashboard, go to the ‘Volumes’ section of the ‘Elastic Block Store’ menu, from there right click on all attached volumes and detach these.
Step 4: Go to the instances menu, right click on your EC2 instance and select ‘Launch more like this’. It’s also possible to create one from scratch but with the ‘Launch more like this’ option you will have the same configuration pre-selected.

Now from the ‘Storage’ options you can select the ‘Edit Storage’ option.
Press ‘Add new volume’ and select ‘Instance Store 0′. This is your ephemeral storage. It will depend how much you can assign. An m1.medium instance will have one 410GB disk drive.

More information at: Amazon EC2 Instance Store – Amazon Elastic Compute Cloud.
Step 5: Once your EC2 instance is created, stop it.
Step 6: Once your EC2 instance has been shutdown, move to the ‘Volumes’ section again and detach the newly created volume from the newly created EC2 instance.
Step 7: Attach the old volumes from the to migrate instance to your newly created one and attach it on the associated mappings. (Don’t worry, /dev/sda1, is ok to assign)
Step 8: Delete the volume created from the ‘Launch more like this’ option and start your new EC2 instance. You can also delete the old EC2 instance. We won’t need it anymore.

The only downside of this procedure is that the new instance will have a new instance id. Hopefully Amazon adds a way to do this once the runtime has been created.

No Comments |

Raspberry Pi – Backup script

So I’ve made a little script for the Raspberry Pi to backup this device on the fly.

Note: It probably isn’t the best way to backup a device once it is running but sometimes it might be the only backup left. As the Pi is notorious for damaging SD cards: http://raspberrypi.stackexchange.com/questions/7040/is-my-raspberry-pi-permanently-damaging-sd-cards

Here’s the script:

if [ "$#" -ne 5 ]; then
        me=`basename $0`
        echo "Usage: $me     "
        exit
fi
 
if [ ! -d "/mnt" ]; then
        echo "No /mnt folder found"
        exit
fi
 
if [ ! -d "/mnt/network/backup" ]; then
        echo "No /mnt/network/backup folder, creating one..."
        mkdir -p /mnt/network/backup 2> /dev/null
fi
 
if [ ! -d "/mnt/network/backup" ]; then
        echo "Couldn't create folder. Are you root?"
        exit
fi
 
shares=$(df -h | grep -i $2)
 
if [ "$shares" != "" ]; then
        echo "Share already mounted."
        echo "Unmounting samba shares..."
        timeout=0
 
        while [ "$shares" != "" ]
        do
                umount -a -t cifs -l
 
                shares=$(df -h | grep -i $2)
                sleep 1
                timeout=`expr $timeout + 1`
 
                if [ "$timeout" -gt 60 ]; then
                        echo "ERROR: Unmount time-out"
                        exit
                fi
        done
fi
 
echo "Mounting $3@$2..."
 
mount -t cifs -o user=$3,password=$4 $2 /mnt/network/backup
 
echo "Searching key file '$5'..."
timeout=0
while [ $timeout -lt 60 ]; do
        FILE="/mnt/network/backup/"$5
 
        timeout=`expr $timeout + 1`
        if [ -f $FILE ]; then
                timeout=100
        fi
 
        if [ $timeout -eq 60 ]; then
                echo "ERROR: Timeout in finding key file: $FILE"
                exit
        fi
        sleep 1
done
 
echo "Backup directory authorized."
 
backup=`cat /etc/hostname`-`date +%s`.dd
 
echo "Creating backup '$backup'..."
 
dd if=$1 of=/mnt/network/backup/$backup bs=4096 conv=notrunc,noerror
 
echo "Unmounting network drives..."
 
umount -a -t cifs -l
 
echo "Done!"

Usage example:

sudo /home/pi/backup.sh /dev/mmcblk0 //ip_remote_server/varia/backup username password backup.key

This script can be scheduled as a cron job every night or week with ‘crontab -e’.

No Comments |

Disaster on my software RAID.

Disaster struck

I’ve always been a fan of software RAID in terms of cheap data redundancy. And I simply can’t seem to grasp why some people still swear by the hardware RAID implementation at home and just stubbornly want nothing to know about software RAID. I really like argument 2 of this article: http://augmentedtrader.wordpress.com/2012/05/13/10-things-raid/ because it’s so damn true.

In the past I’ve encountered quite a few harsh situations for my software RAID:

  • Motherboard + CPU upgrade
  • Twice a disk failure
  • Once a disconnected disk
  • Removed USB drives from external casing and connected them straight to the motherboard
  • Expanded RAID5 from 3, 4 to 5 and finally 6 disks.

Every time my little RAID kept going and going. 5 years ever since Ubuntu 8.04 LTS! However today disaster struck. And it struck badly.

In the past I’ve had my fair share of harsh usage (see above) but this time was different. I’ve received the warning that one of my disks failed and there was no hot spare available to repair.

As stupid as I was, I ssh’ed from my phone and shut down the RAID to reduce any damage.

Big mistake! There where various events happening at the same time:

a) One disk died,  no spared and RAID5.
b) The BIOS battery seemed to be dead, all these years of uptime and have taken it’s toll on the battery. So when I booted the system to troubleshoot the RAID, I saw three superblocks with a date somewhere in March 1970 and one dated 28th of November 2013. Ouch!
c) A small hiccup seemed to occur on the server. A superblock of one of the drives seemed to be lost. Just blank, empty. Might have been my power interrupts when I got home. (The NAS runs with no screen so I don’t really know what is going on, and I often reboot the server)

So yeah the superblocks kinda screwed me over. I tried to rebuild the data like I normally did and get the disks back running but all data was lost. There wasn’t enough precise information to make it work and a sync destroyed my data. Such a shame.

What did I learn?

Keep your RAID alive at all costs, add a hotspare as fast as possible! Because once your system goes down there is no way to know what exactly will happen. Also if possible create a backup of your super blocks.

So the future?

My next project will cover RAID5 of 3 disks for non critical data and a RAID1 of 2 disks for critical storage. I’ll reuse two of the old 1TB disks for the mirroring. (as my RAID 5 was built with 6 x 1TB)

In the meantime here are a few of my hints to search the failed disk.
It is really easy if you have the right tools. I am using an old controller found in an external disk. You can plug it in a power source nearby and debug on the fly with usb and a portable computer.
Pictures below:

DSC_0131 DSC_0127

Once connected run:

sudo badblocks -sv /dev/sdi

Note: this can take a very long time

enira@enira-MS-7740:~$ sudo badblocks -sv /dev/sdi
Checking blocks 0 to 976762495
Checking for bad blocks (read-only test):   0.00% done, 0:00 elapsed. (0/0/0 err 95.00% done, 7:42:14 elapsed. (0/0/0 errors)

Note: Yes I know, it’s a read-only test. But there is no need for a write/read test. Why? because it’s already quite clear: mdadm already stated that a disk is broken. I don’t need to search for read/write errors. I just need to identify a shitty failed disk.

No Comments |

Slitaz project – Expansion – Part 2 – FTP Server

This FTP machine will get the ip 192.168.1.241. Still remember the ip script? Let’s make a call and reboot the server.

/home/base/ip.sh ftp 192.168.1 241 1
reboot

Now install the server and clean the cache.

tazpkg get-install pure-ftpd
tazpkg clean-cache

Once pureftpd is installed you will need to create the correct user for this. In this case I will use ftpgroup and ftpuser.

addgroup -S ftpgroup
adduser -SH ftpuser -G ftpgroup

Tip: to verify if the accounts exist cat the group and passwd file. If they mention ‘ftpgroup’ and ‘ftpuser’, everything is fine.

cat /etc/passwd 
cat /etc/group

So once the unix user is made, create the sub user(contributor in this case) in ‘/home/ftpusers’ and change the owner of the contributor map to ftpuser.

mkdir /home/ftpusers
mkdir /home/ftpusers/contributor
chown ftpuser -R /home/ftpusers/contributor

Now add ad new contributor user to ftpd and assign the unix user of this ftp user to ftpuser

pure-pw useradd contributor -u ftpuser -d /home/ftpusers/contributor
pure-pw mkdb

To verify if the user exists you can write:

pure-pw show contributor

You can see that this user operates under the ftpuser/ftpgroup.

root@base:/# pure-pw show contributor
 
Login              : contributor
Password           : $1$yljZ5iF0$RSfbAJ4ZDtyAQtjOYSKwg.
UID                : 100 (ftpuser)
GID                : 101 (ftpgroup)
Directory          : /home/ftpusers/contributor/./
Full name          :
Download bandwidth : 0 Kb (unlimited)
Upload   bandwidth : 0 Kb (unlimited)
Max files          : 0 (unlimited)
Max size           : 0 Mb (unlimited)
Ratio              : 0:0 (unlimited:unlimited)
Allowed local  IPs :
Denied  local  IPs :
Allowed client IPs :
Denied  client IPs :
Time restrictions  : 0000-0000 (unlimited)
Max sim sessions   : 0 (unlimited)

Now the user is done, create the startup script. First assign the service to load during startup. This can be done in the rcS.conf file.

nano /etc/rcS.conf
RUN_DAEMONS="dbus hald slim firewall dropbear lighttpd pure-ftpd"

Once this is done edit the daemons file to add the options that will be added to the service.

nano /etc/daemons.conf
# Pure FTPd
PUREFTPD_OPTIONS="-4 -H -A -B -j -l puredb:/etc/pureftpd.pdb"

Once this is done, edit the /etc/init.d/pure-ftp file. This is the service start file. Apparently there is a fuckup in this file as the pure-ftpd service isn't like the others. It didn't contain daemon options and everything is just in the startup script. And I don't like that >:(. So let's change it.

nano /etc/init.d/pure-ftpd

Find this line:

OPTIONS="-4 -H -A -B"

and replace it with:

OPTIONS=$PUREFTPD_OPTIONS

Voila FTP server is done, now reboot the machine and see if the service is running once you reboot:

ps | grep -i pure-ftpd
root@base:~# ps | grep -i pure-ftpd
 1392 root       0:00 pure-ftpd (SERVER)
 1436 root       0:00 grep -i pure-ftpd

If everything is running you should be able to connect to ftp://contributor@192.168.1.241/ with windows (or FileZilla, whatever client you are using.)

That's it, below is the download link for this small tutorial:
ftp.7z (12,4 MB)

No Comments |

Shrinking VMWare Images

Sometimes it’s possible that VMWare images become quite large, this is completely normal as files are added and deleted. Usually this isn’t such a problem on machines which are running constantly. But when you need to transfer disk images trough the net, or store disk files every byte sometimes counts.

As an experiment I’ve shrunk the disk images from the Slitaz project.

Shrinking

First enter the os you wish to shrink and zerofill the disk images.

dd if=/dev/zero of=/home/base/wipe.file
rm /home/base/wipe.file

If there are additional storage locations mounted by added disks, then create a zero image there too. The Gluster nodes uses ‘/mnt/data/’ and the mysql uses ‘/var/lib/mysql/’.

Then shutdown the machine and run a command line from your host system.

cd "c:\Program Files (x86)\VMware\VMware Workstation"

Once you are in the folder use the vmware-vdiskmanager to decrease the vmdk files:

vmware-vdiskmanager.exe -k "drag vmdk file here"

Results

mysql.7z             old: 45.6 MB         new: 20.7 MB        saved: 55%
glusternodes.7z      old: 254 MB          new: 165 MB         saved: 35%
glusterclient.7z     old: 78.7 MB         new: 50.9 MB        saved: 35%
webnodes.7z          old: 253 MB          new: 159 MB         saved: 37%
loadbalancers.7z     old: 124 MB          new: 67.1 MB        saved: 46%

The results are clearly astonishing. The lowest space decrease is 35% while the biggest one is 55% of space saved. In total I’ve decreased the file size on my webhost by 292.6 MB! That’s enough.

 

No Comments |

Bitcoin mining – Xubuntu 32-bit, cgminer and icarus support.

Introduction

So recently I’ve started ASIC bitcoin mining and I’ve ordered some block eruptors (icarus protocl) and I have a spare ASUS EEE device lying around with 3 USB ports.  Ideal for mining, low power, small and silent. However it’s a 32-bit device and cgminer isn’t built for this :(.

Time to take matters in our own hands.

Installing Xubuntu on the ASUS EEE-PC

Xubuntu, it’s a clear choice for this EEE-PC. Light, versatile, a lot of packages! You can get it here: http://xubuntu.org/

Install is pretty straightforward, I won’t cover this here.

Installing openssh

I like my installations to come with a openssh server. Just to see what’s going on. (For a web based stats page with phpSysInfo, see my tutorials on Slitaz)

sudo apt-get update
sudo apt-get install openssh-server

Compiling and installing cgminer

The problem with cgminer is that I can’t find it pre compiled for 32 bit, and the usblib it uses is pretty non standard. So grabbing the usblib found in ubuntu won’t work.

First make sure you have all the latest packages installed, and grab the dependencies to compile cgminer.

sudo apt-get upgrade
 
sudo apt-get install libcurl4-openssl-dev ncurses-dev libudev-dev

Now, let’s download a version of cgminer and an old usblib which will work and unzip the files.

cd /home/enira/
 
wget https://github.com/kanoi/cgminer-binaries/blob/master/libusb-1.0.16-rc10.tar.bz2?raw=true -O libusb-1.0.16-rc10.tar.bz2
wget http://ck.kolivas.org/apps/cgminer/3.4/cgminer-3.4.0.tar.bz2
 
tar -xvjf libusb-1.0.16-rc10.tar.bz2
tar -xvjf cgminer-3.4.0.tar.bz2

Now copy libusb to the cgminer folder and compile it.

mkdir ./cgminer-3.4.0/libusb
cp -R libusb-1.0.16-rc10 ./cgminer-3.4.0/libusb
 
cd /home/enira/cgminer-3.4.0/libusb/libusb-1.0.16-rc10/
 
./configure
make

Then compile cgminer.

cd /home/enira/cgminer-3.4.0/
 
LIBUSB_CFLAGS="-I./libusb/libusb-1.0.16-rc10/libusb" LIBUSB_LIBS="./libusb/libusb-1.0.16-rc10/libusb/.libs/libusb-1.0.a -ludev -lrt" CFLAGS="-g -W -Wall" ./configure --enable-bflsc --enable-icarus --enable-bitforce --enable-modminer --enable-ztex --enable-avalon

The output should look like this:

------------------------------------------------------------------------
cgminer 3.4.0
------------------------------------------------------------------------
Configuration Options Summary:
 
curses.TUI...........: FOUND: -lncurses
OpenCL...............: NOT FOUND. GPU mining support DISABLED
scrypt...............: Disabled (needs OpenCL)
ADL..................: SDK NOT found, GPU monitoring support DISABLED
 
Avalon.ASICs.........: Enabled
BFL.ASICs............: Enabled
BitForce.FPGAs.......: Enabled
Icarus.FPGAs.........: Enabled
ModMiner.FPGAs.......: Enabled
Ztex.FPGAs...........: Enabled
 
Compilation............: make (or gmake)
CPPFLAGS.............:
CFLAGS...............: -g -W -Wall -I./libusb/libusb-1.0.16-rc10/libusb
LDFLAGS..............: -lpthread
LDADD................: -lcurl compat/jansson/libjansson.a -lpthread -lm ./libusb/libusb-1.0.16-rc10/libusb/.libs/libusb-1.0.a -ludev -lrt
 
Installation...........: make install (as root if needed, with 'su' or 'sudo')
prefix...............: /usr/local

 

And make the file:

make

Once this is done create your miner script, I created mine in my user dir.

nano /home/enira/mine_bitcoins.sh
#!/bin/sh
cd /home/enira/cgminer-3.4.0/
./cgminer -o stratum+tcp://mint.bitminter.com:3333 -u uuu -p ppp

And assign rights:

chmod +x /home/enira/mine_bitcoins.sh

Now install an xrdp so you can get a remote desktop and go nuts!

sudo apt-get install xrdp
 
echo xfce4-session > ~/.xsession

That’s all there is to it.

No Comments |

Webscraping – Part 2 – Basic XPath HTML queries in JAVA

Introduction

So, this tutorial continues to build on the first part. If you haven’t read this, I recommend to do so. As it contains vital information that I won’t repeat in this post.

Note: please download and open the eclipse project. The run methods can be found in the Scraping02 class.

Additional queries

Sometimes it is possible that the HTML structure is so fucked up that you need to take matters in your own hand. In run1 I’ll show you how to loop trough all elements of the images table one by one.

So ‘node.getFirstChild()’ is the <td> tag, the ‘node.getFirstChild().getNextSibling()’ is the first <img> tag. To loop trough you will need to get ‘next.getNextSibling().getNextSibling()’ as the first nextSibling is the current element.

Example:

private void run1(Document document) {
	System.out.println("run1:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		Node node = (Node) xpath.evaluate("//*[@class='images']//td",
				document, XPathConstants.NODE);
 
		Node next = node.getFirstChild().getNextSibling();
		do {
			String image = next.getAttributes().getNamedItem("src")
					.toString();
 
			// result is: src="image1.jpg", so split
			System.out.println(image.split("\"")[1]);
 
			next = next.getNextSibling().getNextSibling();
		} while (next != null);
 
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run1:
image1.jpg
image2.jpg
image3.jpg
image4.jpg
image5.jpg

This hasn’t been covered in the past but it is quite useful for web scraping. XPath queries can select the attribute values too. This can be done using the ‘@’ character. This comes quite handy when selecting links from <a href=”"> tags. In run2 the query selects the links (href attribute) from all <a>tags in the tag defined by a class named ‘links’.

private void run2(Document document) {
	System.out.println("run2:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate(
				"//*[@class='links']//a/@href", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run2:
http://enira.net/links/link1.htm
http://enira.net/links/link2.htm
http://enira.net/links/link3.htm
http://enira.net/links/link4.htm

In run3 I’ve combined run2 together with run1. Here you can see that HtmlCleaner corrects thags that aren’t correct. (In this case the <img> tag). The XPath query reads: select all elements with a class ‘images, from this class select the ‘img’ element with a td parent at any level/depth and get the src attribute.

private void run3(Document document) {
	System.out.println("run3:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate(
				"//*[@class='images']//td/img/@src", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run3:
image1.jpg
image2.jpg
image3.jpg
image4.jpg
image5.jpg

So that’s about it, now you should be able to scrape the web quite easily. I think I’ve covered enough for you to get started scraping pages using XPath.

Downloads:

Example project (Eclipse): Scraping02.zip (132 KB)

No Comments |

Webscraping – Part 1 – Basic XPATH HTML queries in JAVA

Introduction

So occasionally I want to scrape a web page of links or images. Being a programmer I hate such a tedious tags and I always end up writing a script. In the past I always used a combination of substring(), indexOf or some other string formatting functions. However websites are also/still XML files! This enables a much easier method to search content: XPATH.

This little tutorial will handle XPATH queries on a plain HTML page in Java. I am using an adapted tutorial that can be found at: http://manual.calibre-ebook.com/xpath.html.

Setup

The Java library that I will use in this tutorial is HtmlCleaner. HtmlCleaner is open-source HTML parser written in Java and it cleans up any ill written HTML code. You can download it at: http://htmlcleaner.sourceforge.net/download.php.

 
Right click and go to ‘Properties’. Go to the tab ‘Libraries’ and press ‘Add JARs’.

I also provided a test file. In the project you can find it in the ‘src/resources’ folder. The function readDocument() will read this file and create a usable Document object.

private Document readDocument() {
	String content = null;
	try {
		content = FileUtils
				.readLargeTextFileUTF8("src/resources/index.html");
	} catch (IOException e) {
 
	}
 
	TagNode tagNode = new HtmlCleaner().clean(content);
	Document doc = null;
	try {
		doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);
		return doc;
	} catch (ParserConfigurationException e) {
		e.printStackTrace();
	} catch (Exception e) {
		e.printStackTrace();
	}
 
	return null;
}

This function will be used by all the run examples.

Selecting by tagname

So this first example will select all h2 tags from the top level. Please note: ‘//’ defines an element directly one level from the root. HtmlCleaner considers the body tag as the root of the document.

private void run1(Document document) {
	System.out.println("run1:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate("//h2", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run1:
Chapter One
Chapter Two

This next example will search all ‘div’ elements and show the p tags inside the div. Note: This example takes only the div tags on the top level!

private void run2(Document document) {
	System.out.println("run2:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate("//div/p", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run2:
A very short ebook to demonstrate the use of XPath.

This next example will get also the same result as the second run, however in this case the div element can be any nested div child, and doesn’t needs to be a sub child from the root node.

private void run3(Document document) {
	System.out.println("run3:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate("//*/div/p", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run3:
A very short ebook to demonstrate the use of XPath.

This next example will select based on name of the element. In this case XPATH will select all <h1> and <h2> tags.

private void run4(Document document) {
	System.out.println("run4:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate(
				"//*[name()='h1' or name()='h2']", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run4:
A very short ebook
Chapter One
Chapter Two

 

Selecting by attributes

In XPATH it is also possible to select based on attributes. Next example selects the content of all tags containing an attribute style=”.

private void run5(Document document) {
	System.out.println("run5:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate("//*[@style]", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run5:
Written by Kovid Goyal

 

Next example will select all chapter classes. This is done by adding a class selector like in run5, but specify it to match only ‘chapter’ attribute values.

private void run6(Document document) {
	System.out.println("run6:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate("//*[@class='chapter']",
				document, XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run6:
Chapter One
Chapter Two

Now let’s get a little bit advanced: next example will select all <h1> tags which have a class named ‘bookTitle’. The xpath query reads: From the top level select all h1 elements with an attribute class that matches the value ‘bookTitle’.

private void run7(Document document) {
	System.out.println("run7:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	String str;
	try {
		str = (String) xpath.evaluate("//h1[@class='bookTitle']", document,
				XPathConstants.STRING);
 
		System.out.println(str);
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run7:
A very short ebook

 

Selecting by tag content

XPATH is also able to select content based on text containing a certain value. In run8 will show you how to select all <h2> tags containing the text string ‘One’.  Note: XPath is case sensitive!

private void run8(Document document) {
	System.out.println("run8:");
 
	XPath xpath = XPathFactory.newInstance().newXPath();
	try {
		NodeList nodes = (NodeList) xpath.evaluate(
				"//h2[contains(., 'One')]", document,
				XPathConstants.NODESET);
 
		for (int i = 0; i < nodes.getLength(); i++) {
			System.out.println(nodes.item(i).getTextContent());
		}
	} catch (XPathExpressionException e) {
		e.printStackTrace();
	}
 
	System.out.println("");
}

Output:

run8:
Chapter One

Downloads:

Example project (Eclipse): Scraping01.zip (132 KB)

No Comments |