Warning: INSERT command denied to user 'dbo292345962'@'198.71.62.128' for table 'watchdog' query: INSERT INTO watchdog (uid, type, message, variables, severity, link, location, referer, hostname, timestamp) VALUES (0, 'php', '%message in %file on line %line.', 'a:4:{s:6:\"%error\";s:12:\"user warning\";s:8:\"%message\";s:401:\"INSERT command denied to user 'dbo292345962'@'198.71.62.128' for table 'captcha_sessions'\nquery: INSERT into captcha_sessions (uid, sid, ip_address, timestamp, form_id, solution, status, attempts) VALUES (0, 'e4124f3d7557ec9eba3f10083f84ee71', '54.198.147.221', 1500568245, 'comment_form', '27c8c106e504a50facb1af9ed299c76e', 0, 0)\";s:5:\"%file\";s:62:\"/homepages/25/d199835659/htdocs/ID/modules/captcha/captcha.inc\";s:5:\"%line\";i:99;} in /homepages/25/d199835659/htdocs/ID/includes/database.mysql.inc on line 135
How to Write an HTML Parser in Java | Inferno Development

How to Write an HTML Parser in Java

Parsing HTML is a complicated and difficult task. There are many complex tools and libraries out there to do the trick for you. Nonetheless, I will show you how to write a simple HTML parser in 36 lines of code.

The idea behind parsing HTML is simple: remove the content tokens that are held within the HTML code. However, the implementation of this has become increasingly difficult over the years. Websites that were once coded in only standard HTML have evolved into complex sites that use XHTML, CSS, JavaScript, Java, Flash, etc.

So depending on the website, the parser may have to deal with many different cases. This in turn means the code will need to be diverse. For the sake of simple demonstration, we will not store the parsed content into trees or other data structures, as some parsers do. We will only clean up the content by removing the tags around it. Before anything else, let's discuss our options in finding and removing the tags.

Stacks: Poor Choice

Let's take a quick second to think about how to code this here. An HTML tag will typically start with a '<' and end with a '>'. The open and close brackets, along with the tag and attributes inside, are unimportant to us. So we could read this website in one character at a time, and when we come across an open bracket, ignore it. Then continue to ignore the remaining characters until we get to the close bracket, which we would also ignore.

If you wanted to be creative, you could implement this using stacks. Without getting into details on how to actually implement this, I'll just tell you there are many problems with this. One issue is the obvious: what if the content uses a bracket?

Another is that it does not clean up the JavaScript and CSS code. So can we alter the algorithm to clean up everything? The better question would be: is there a simpler & more productive way to handle all of these cases? The answer is yes.

Regular Expressions: Better Choice

For our code, we will use regular expressions (regex). Regular expressions allow for flexible identification of strings, characters, and patterns. Regex is extremely powerful and is available in many different programming languages.

Let's look at how we will be implementing regex in our code:

// Compile the regular expression pattern for a typical HTML tag
Pattern tag = Pattern.compile("<.*?>");
               
// Create the matcher object. Invoke the matcher method on the tag pattern.
// Content is our input string.
Matcher mtag = tag.matcher(content);
               
// Perform the match operation on the input, and replace with nothing.
while (mtag.find())     content = mtag.replaceAll("");

This will compare the input string with the regex pattern, and when it finds such a pattern, it will be replaced with nothing. This is the same as removing it. So the regex expression <.*?> will compare for a typical HTML tag. For more information on creating regex patterns, visit: http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

However, before we remove all the HTML tags, it is important to look for specific tags. Some tags have content between them that needs to be removed. Just as a note, using the pattern expression <.*?>.*?</.*?> will allow us to find and remove the opening and closing tags, and the content that is included. However, this is not exactly useful in this case.

There are certain tags in HTML that have information that can be regarded as non-content, and some as content. What I mean by non-content is JavaScript and CSS code. In most cases, these will be useless, so let's find and remove them. To locate these entire sections of code, we will use regex to search for: <style.*?>.*?</style> and <script.*?>.*?</script>.

Reading From the Web

The code I am proposing will read in the source code from the website one line at a time using InputStreamReader and BufferedReader, and append each line to one String. At the end of each line, I want to add a new line character. The following code is an example of this, but this code will have to be altered, as you will soon see:

InputStreamReader pageInput = new InputStreamReader(address.openStream());
BufferedReader source = new BufferedReader(pageInput);

while ((sourceLine = source.readLine()) != null) {
        content += sourceLine + "\n";
}

If we are reading the source file in line by line, and then retaining a new line character at the end of each line, this will present another issue that will need to be dealt with. HTML code is primarily broken up over the course of multiple lines. JavaScript and CSS code is often done this way. Here is an example of a simple JavaScript code, in which you will notice the code spans multiple lines:

<script type="text/javascript">
        document.write("This is a ");
        document.write("simple example!");
</script>

If we reserve the new line characters (\n) in our content, our regex code will fail. To fix this, switch the "\n" with a "\t" (which is a tab character). This will allow us to reserve the place where the new line characters were, while keeping the code all on a single line. Now the regex searches will work. Later, we can translate the "\t" back into "\n" using this code:

// Remove the tab characters. Replace with new line characters.
Pattern nLineChar = Pattern.compile("\t+");
Matcher mnLine = nLineChar.matcher(content);
while (mnLine.find()) content = mnLine.replaceAll("\n");

Final Code

There are a few other cases I added to the finished code, which will provide a starting position that you can branch off of. Each website is different, and thus you will need to alter the code to tailor to your needs. Take a look and try it out for yourself.

// Inferno Development

import java.net.*;
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class htmlContentParser {
        public static void main (String[] args) throws IOException {
 
                String sourceLine;
                String content = "";

                // The URL address of the page to open.
                URL address = new URL("http://www.example.com/");

                // Open the address and create a BufferedReader with the source code.
                InputStreamReader pageInput = new InputStreamReader(address.openStream());
                BufferedReader source = new BufferedReader(pageInput);

                // Append each new HTML line into one string. Add a tab character.
                while ((sourceLine = source.readLine()) != null)
                        content += sourceLine + "\t";

                // Remove style tags & inclusive content
                Pattern style = Pattern.compile("<style.*?>.*?</style>");
                Matcher mstyle = style.matcher(content);
                while (mstyle.find()) content = mstyle.replaceAll("");

                // Remove script tags & inclusive content
                Pattern script = Pattern.compile("<script.*?>.*?</script>");
                Matcher mscript = script.matcher(content);
                while (mscript.find()) content = mscript.replaceAll("");

                // Remove primary HTML tags
                Pattern tag = Pattern.compile("<.*?>");
                Matcher mtag = tag.matcher(content);
                while (mtag.find()) content = mtag.replaceAll("");

                // Remove comment tags & inclusive content
                Pattern comment = Pattern.compile("<!--.*?-->");
                Matcher mcomment = comment.matcher(content);
                while (mcomment.find()) content = mcomment.replaceAll("");

                // Remove special characters, such as &nbsp;
                Pattern sChar = Pattern.compile("&.*?;");
                Matcher msChar = sChar.matcher(content);
                while (msChar.find()) content = msChar.replaceAll("");

                // Remove the tab characters. Replace with new line characters.
                Pattern nLineChar = Pattern.compile("\t+");
                Matcher mnLine = nLineChar.matcher(content);
                while (mnLine.find()) content = mnLine.replaceAll("\n");

                // Print the clean content & close the Readers
                System.out.println(content);
                pageInput.close();
                source.close();
        }
}

Anonymous's picture

i want to get href and src

i want to get href and src values from html how i get these?

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.