≡ Menu

How to parse HTML using PHP?

The question – I need a way to parse HTML using PHP. Is there a library available that can help me do this?

There is a package available for PHP called PHP Simple HTML DOM which can be used to do this. In this tutorial, I will explain in detail with examples on how to use this library. Note that there may be other options available, but I chose to use this because of its simplicity of syntax.

First, download the package from this page.

Now I will show you some examples on how to use this library:

Example 1

This example explains how to get the meta tag generator from the HTML of the page. It first loads the library using require_once, sets the memory limit to unlimited (using ini_set), gets the HTML contents and creates a new object called simple_html_dom. Then it searches for meta tags with name=generator and iterates through all in a PHP foreach loop. If you want all generators, you may choose to put them in a list or an array.

$domainname = "http://www.ewhathow.com";
require_once("simple_html_dom.php");
ini_set('memory_limit', '-1');
$out = file_get_contents($domainname);
$htmlcontents = strtolower($out);
$html = new simple_html_dom();
$html->load($htmlcontents);
foreach($html->find('meta[name=generator]') AS $element)
{
	$generator = trim($element->content);
	break;
}
$html->clear();
echo $generator;

Example 2

This example extracts all outgoing links on a page and writes them to the standard output.

$domainname = "http://www.ewhathow.com";
require_once("simple_html_dom.php");
ini_set('memory_limit', '-1');
$out = file_get_contents($domainname);
$htmlcontents = strtolower($out);
$html = new simple_html_dom();
$html->load($htmlcontents);
foreach($html->find('a') AS $element)
{
	$href = $element->href;
	echo $href;
}

$html->clear();
echo $generator;

Example 3

This example outputs all images from a page:

$domainname = "http://www.ewhathow.com";
require_once("simple_html_dom.php");
ini_set('memory_limit', '-1');
$out = file_get_contents($domainname);
$htmlcontents = strtolower($out);
$html = new simple_html_dom();
$html->load($htmlcontents);
foreach($html->find('img') AS $element)
{
	$src = $element->src;
	echo $src . "\n";
}

$html->clear();
echo $generator;

Example 4

Modify the contents of a HTML string:

$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html; 
$html->clear();
/* Output: <div id="hello">foo</div><div id="world" class="bar">World</div> */

Note that after processing HTML using the simple_html_dom object, you must call the method clear() on it. Otherwise, the HTML gets piled up and there is an excessive amount of memory usage. This could be a bug, but I have learnt to live with it. Also, sometimes, the HTML on the page is too large and in that case, you should increase the memory_limit of PHP to unlimited (-1) or to some value like 128M.

So, these were 4 interesting examples on how to parse HTML in a PHP script. I hope this was useful for you!

Comments on this entry are closed.