Jsoup Java Html Parser Example
1- What is Jsoup?
Jsoup is a java html parser. It is a java library that is used to parse HTML document. Jsoup provides api to extract and manipulate data from URL or HTML file. It uses DOM, CSS and Jquery-like methods for extracting and manipulating file.
Let's look at an example with Jsoup:
HelloJsoup.java
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class HelloJsoup { public static void main( String[] args ) throws IOException{ Document doc = Jsoup.connect("http://eclipse.org").get(); String title = doc.title(); System.out.println("Title : " + title); } }
2- Jsoup Library
You can use Maven or download the Jsoup library.
Using maven:
<!-- http://mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.8.3</version> </dependency>
Or download:
3- Jsoup API
Jsoup includes many classes, however, its three most important classes are:
- org.jsoup.Jsoup
- org.jsoup.nodes.Document
- org.jsoup.nodes.Element
- Jsoup.java
Method | Description |
---|---|
static Connection connect(String url) | create and returns connection of URL. |
static Document parse(File in, String charsetName) | parses the specified charset file into document. |
static Document parse(File in, String charsetName, String baseUri) | parses the specified charset and baseUri file into Document. |
static Document parse(String html) | parses the given html code into document. |
static Document parse(String html, String baseUri) | parses the given html code with baseUri into Document. |
static Document parse(URL url, int timeoutMillis) | parses the given URL into Document. |
static String clean(String bodyHtml, Whitelist whitelist) | returns safe HTML from input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes. |
- Document.java
Methods | Description |
---|---|
Element body() | Accessor to the document's body element. |
Charset charset() | Returns the charset used in this document. |
void charset(Charset charset) | Sets the charset used in this document. |
Document clone() | Create a stand-alone, deep copy of this node, and all of its children. |
Element createElement(String tagName) | Create a new Element, with this document's base uri. |
static Document createShell(String baseUri) | Create a valid, empty shell of a document, suitable for adding more elements to. |
Element head() | Accessor to the document's head element. |
String location() | Get the URL this Document was parsed from. |
String nodeName() | Get the node name of this node. |
Document normalise() | Normalise the document. |
String outerHtml() | Get the outer HTML of this node. |
Document.OutputSettings outputSettings() | Get the document's current output settings. |
Document outputSettings(Document.OutputSettings outputSettings) | Set the document's output settings. |
Document.QuirksMode quirksMode() | |
Document quirksMode(Document.QuirksMode quirksMode) | |
Element text(String text) | Set the text of the body of this document. |
String title() | Get the string contents of the document's title element. |
void title(String title) | Set the document's title element. |
boolean updateMetaCharsetElement() | Returns whether the element with charset information in this document is updated on changes through Document.charset(Charset) or not. |
void updateMetaCharsetElement(boolean update) | Sets whether the element with charset information in this document is updated on changes through Document.charset(Charset) or not. |
- Element.java
4- Manipulating Document
4.1- Create Documet from URL
GetDocumentFromURL.java
package org.o7planning.tutorial.jsoup.document; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class GetDocumentFromURL { public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("http://eclipse.org").get(); String title = doc.title(); System.out.println("Title : " + title); } }
Running example:
4.2- Create Document from File
GetDocumentFromFile.java
package org.o7planning.tutorial.jsoup.document; import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class GetDocumentFromFile { public static void main(String[] args) throws IOException { File htmlFile = new File("C:/index.html"); Document doc = Jsoup.parse(htmlFile, "UTF-8"); String title = doc.title(); System.out.println("Title : " + title); } }
4.3- Create Document from String
GetDocumentFromString.java
package org.o7planning.tutorial.jsoup.document; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class GetDocumentFromString { public static void main(String[] args) throws IOException { String htmlString = "<html><head><title>Simple Page</title></head>" + "<body>Hello</body></html>"; Document doc = Jsoup.parse(htmlString); String title = doc.title(); System.out.println("Title : " + title); System.out.println("Content:\n"); System.out.println(doc.toString()); } }
Running example:
4.4- Parsing HTML Fragment
A full HTML document includes Header and Body, sometimes you also need to parse an HTML fragment. And you can get a full HTML document includes headers and body. See for example:
ParsingBodyFragment.java
package org.o7planning.tutorial.jsoup.document; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class ParsingBodyFragment { public static void main(String[] args) throws IOException { String htmlFragment = "<h1>Hi you!</h1><p>What is this?</p>"; Document doc = Jsoup.parseBodyFragment(htmlFragment); String fullHtml = doc.html(); System.out.println(fullHtml); } }
Running example:
5- DOM Methods
Jsoup has some methods similar to the method in the DOM model ( Parsing XML document)
Methods | Description |
Element getElementById(String id) | Find an element by ID, including or under this element. |
Elements getElementsByTag(String tag) | Finds elements, including and recursively under this element, with the specified tag name. |
Elements getElementsByClass(String className) | Find elements that have this class, including or under this element. |
Elements getElementsByAttribute(String key) | Find elements that have a named attribute set. Case insensitive. |
Elements siblingElements() | Get sibling elements. |
Element firstElementSibling() | Gets the first element sibling of this element. |
Element lastElementSibling() | Gets the last element sibling of this element. |
...... |
The method of retrieving data of Element.
Method | Description |
String attr(String key) | Get an attribute's value by its key. |
void attr(String key, String value) | Set an attribute. If the attribute already exists, it is replaced. |
String id() | Return The id attribute, if present, or an empty string if not. |
String className() | Gets the literal value of this element's "class" attribute, which may include multiple class names, space separated. (E.g. on <div class="header gray"> returns, " header gray") |
Set<String> classNames() | Get all of the element's class names. E.g. on element <div class="header gray">, returns a set of two elements "header", "gray". Note that modifications to this set are not pushed to the backing class attribute; use the classNames(java.util.Set) method to persist them. |
String text() | Gets the combined text of this element and all its children. |
void text(String value) | Set the text of this element. |
String html() | Retrieves the element's inner HTML. E.g. on a <div><p>a</p></div> , would return <p>a</p> . (Whereas Node.outerHtml() would return <div><p>a</p></div> .) |
void html(String value) | Set this element's inner HTML. Clears the existing HTML first. |
Tag tag() | Get the Tag for this element |
String tagName() | Get the name of the tag for this element. E.g. div |
...... |
The methods to manipulate HTML:
Methods | Description |
Element append(String html) | Add inner HTML to this element. The supplied HTML will be parsed, and each node appended to the end of the children. |
Element prepend(String html) | Add inner HTML into this element. The supplied HTML will be parsed, and each node prepended to the start of the element's children. |
Element appendText(String text) | Create and append a new TextNode to this element. |
Element prependText(String text) | Create and prepend a new TextNode to this element. |
Element appendElement(String tagName) | Create a new element by tag name, and add it as the last child. |
Element prependElement(String tagName) | Create a new element by tag name, and add it as the first child. |
Element html(String value) | Set this element's inner HTML. Clears the existing HTML first. |
...... |
For example, using the DOM methods, parsing an HTML document and retrieve information of form tag.
register.html
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Register</title> </head> <body> <form id="registerForm" action="doRegister" method="post"> <table> <tr> <td>User Name</td> <td><input type="text" name="userName" value="Tom" /></td> </tr> <tr> <td>Password</td> <td><input type="password" name="password" value="Tom001" /></td> </tr> <tr> <td>Email</td> <td><input type="email" name="email" value="theEmail@gmail.com" /></td> </tr> <tr> <td colspan="2"><input type="submit" name="submit" value="Submit" /></td> </tr> </table> </form> </body> </html>
ReadHtmlForm.java
package org.o7planning.tutorial.jsoup.dom; import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class ReadHtmlForm { public static void main(String[] args) throws IOException { Document doc = Jsoup.parse(new File("files/register.html"), "utf-8"); Element form = doc.getElementById("registerForm"); System.out.println("Form action = "+ form.attr("action")); Elements inputElements = form.getElementsByTag("input"); for (Element inputElement : inputElements) { String key = inputElement.attr("name"); String value = inputElement.attr("value"); System.out.println(key + " = " + value); } } }
Running example:
GetAllLinks.java
package org.o7planning.tutorial.jsoup.dom; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class GetAllLinks { public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("http://o7planning.org").get(); // Elements extends ArrayList<Element>. Elements aElements = doc.getElementsByTag("a"); for (Element aElement : aElements) { String href = aElement.attr("href"); String text = aElement.text(); System.out.println(text); System.out.println("\t" + href); } } }
Running example:
6- The methods similar to jQuery,Css
You want to find or manipulate elements using a CSS or jquery-like selector syntax?
JSoup give you a few methods to do this:
- Element.select(String selector)
- Elements.select(String selector)
Example:
Connection conn = Jsoup.connect("http://o7planning.org"); Document doc = conn.get(); // a with href Elements links = doc.select("a[href]"); // img with src ending .png Elements pngs = doc.select("img[src$=.png]"); // div with class=masthead Element masthead = doc.select("div.masthead").first(); // direct a after h3 Elements resultLinks = doc.select("h3.r > a");
Jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.
The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.
Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.
Selector overview
Selector | Description |
tagname | find elements by tag, e.g. a |
ns|tag | find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements |
#id | find elements by ID, e.g. #logo |
.class: | find elements by class name, e.g. .masthead |
[attribute] | elements with attribute, e.g. [href] |
[^attr] | elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes |
[attr=value] | elements with attribute value, e.g. [width=500] (also quotable, like sequence") |
[attr^=value], [attr$=value], [attr*=value] | elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/] |
[attr~=regex] | elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)] |
* | all elements, e.g. * |
Selector combinations
Selector | Description |
el#id | elements with ID, e.g. div#logo |
el.class | elements with class, e.g. div.masthead |
el[attr] | elements with attribute, e.g. a[href] |
Any combination, e.g. a[href].highlight | |
ancestor child | child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body" |
parent > child | child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag |
siblingA + siblingB | finds sibling B element immediately preceded by sibling A, e.g. div.head + div |
siblingA ~ siblingX | finds sibling X element preceded by sibling A, e.g. h1 ~ p |
el, el, el | group multiple selectors, find elements that match any of the selectors; e.g. div.masthead, div.logo |
Pseudo selectors
Selector | Description |
:lt(n) | find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3) |
:gt(n) | find elements whose sibling index is greater than n; e.g. div p:gt(2) |
:eq(n) | find elements whose sibling index is equal to n; e.g. form input:eq(1) |
:has(seletor) | find elements that contain elements matching the selector; e.g. div:has(p) |
:not(selector) | find elements that do not match the selector; e.g. div:not(.logo) |
:contains(text) | find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup) |
:containsOwn(text) | find elements that directly contain the given text |
:matches(regex) | find elements whose text matches the specified regular expression; e.g. div:matches((?i)login) |
:matchesOwn(regex) | find elements whose own text matches the specified regular expression |
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, et |
QueryLinks.java
package org.o7planning.tutorial.jsoup.selector; import java.io.IOException; import java.util.Iterator; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class QueryLinks { public static void main(String[] args) throws IOException { Connection conn = Jsoup.connect("http://o7planning.org"); Document doc = conn.get(); // Query <a> elements, href contain /document/ String cssQuery = "a[href*=/document/]"; Elements elements= doc.select(cssQuery); Iterator<Element> iterator = elements.iterator(); while(iterator.hasNext()) { Element e = iterator.next(); System.out.println(e.attr("href")); } } }
Results:
document.html
<html> <head> <title>Jsoup Example</title> </head> <body> <h1>Java Tutorial For Beginners</h1> <br> <div id="content"> Content .... </div> <div class="related-container"> <h3>Related Documents</h3> <a href="http://o7planning.org/web/fe/default/en/document/649342/guide-to-installing-and-configuring-eclipse"> Guide to Installing and Configuring Eclipse </a> <a href="http://o7planning.org/web/fe/default/en/document/649326/guide-to-installing-and-configuring-java"> Guide to Installing and Configuring Java </a> <a href="http://o7planning.org/web/fe/default/en/document/245310/jdk-javadoc-in-chm-format"> Jdk Javadoc in chm format </a> </div> </body> </html>
SelectorDemo1.java
package org.o7planning.tutorial.jsoup.selector; import java.io.File; import java.io.IOException; import java.util.Iterator; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class SelectorDemo1 { public static void main(String[] args) throws IOException { File htmlFile = new File("document.html"); Document doc = Jsoup.parse(htmlFile, "UTF-8"); // First <div> element has class ="related-container" Element div = doc.select("div.related-container").first(); // List the <h3>, the direct child elements of the current element. Elements h3Elements = div.select("> h3"); // Get first <h3> element Element h3 = h3Elements.first(); System.out.println(h3.text()); // List <a> elements, is a descendant of the current element Elements aElements = div.select("a"); // Query the current element list. // The element that href contains 'installing'. Elements aEclipses = aElements.select("[href*=Installing]"); Iterator<Element> iterator = aEclipses.iterator(); while (iterator.hasNext()) { Element a = iterator.next(); System.out.println("Document: "+ a.text()); } } }
Results:
Source: https://o7planning.org/10399/jsoup-java-html-parser
0 Response to "Jsoup Java Html Parser Example"
Post a Comment