Extract HTML with PHP’s DOM extension

I recently had to parse HTML with PHP and had a look at PHP's DOM at last. Here's a way to extract an element's content by ID:

 
<?php
/**	 
 * Extract an element by ID from an HTML document
 * Thanks http://codjng.blogspot.com/2009/10/unicode-problem-when-using-domdocument.html
 *
 * @param string $content A website
 *
 * @return string HTML content
 */

function extract_id( $content, $id ) {
	// use mb_string if available
	if ( function_exists( 'mb_convert_encoding' ) )
		$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
	$dom= new DOMDocument();
	$dom->loadHTML( $content );
	$dom->preserveWhiteSpace = false;	$element = $dom->getElementById( $id );
	$innerHTML = innerHTML( $element );
	return( $innerHTML ); 
}

/**	 
 * Helper, returns the innerHTML of an element
 *
 * @param object DOMElement
 *
 * @return string one element's HTML content
 */

function innerHTML( $contentdiv ) {
	$r = '';
	$elements = $contentdiv->childNodes;
	foreach( $elements as $element ) { 
		if ( $element->nodeType == XML_TEXT_NODE ) {
			$text = $element->nodeValue;
			// IIRC the next line was for working around a
			// WordPress bug
			//$text = str_replace( '<', '&lt;', $text );
			$r .= $text;
		}	 
		// FIXME we should return comments as well
		elseif ( $element->nodeType == XML_COMMENT_NODE ) {
			$r .= '';
		}	 
		else {
			$r .= '<';
			$r .= $element->nodeName;
			if ( $element->hasAttributes() ) { 
				$attributes = $element->attributes;
				foreach ( $attributes as $attribute )
					$r .= " {$attribute->nodeName}='{$attribute->nodeValue}'" ;
			}	 
			$r .= '>';
			$r .= innerHTML( $element );
			$r .= "</{$element->nodeName}>";
		}	 
	}	 
	return $r;
}
?>

As you can see the code is not polished, but maybe it'll be useful to you.

8 comments

  1. avatar
    wrote this comment on

    Not bad, could come handy in certain situations but why didn't you use javascript (maybe even with jquery or mootools) to extract DOM content?

  2. avatar
    wrote this comment on

    Because I needed to do it on the server. Are you suggesting I use a server-side JavaScript interpreter?

  3. avatar
    wrote this comment on

    Nope, Atm I don't see why someone should use JavaScript as a server side language of choice..if you need to do sth on server-side php is just fine (ok, not as cool as python but anyway ;-)). Mind you asking me why you had to parse HTML? Just curious...

  4. avatar
    wrote this comment on

    great !! thanks alot

  5. avatar
    wrote this comment on
    Ahh thanks! Just what I was looking for, Cheers!
  6. avatar
    wrote this comment on
    Nicolas! Ta for the handy pointers here, but you forgot to take account of self-closing tags! Just putting your three tag-closing lines inside a wee regex conditional seems to sort it out, though. :-)
  7. avatar
    wrote this comment on
    Awesome, this was just what i needed! Thank you very much! Working like a charm :)
  8. avatar
    wrote this comment on
    This is a very nice code. But what id I had only a space inside a node ( ). How could I preserve that?

Reply

Cancel reply
Markdown. Syntax highlighting with <code lang="php"><?php echo "Hello, world!"; ?></code> etc.
DjangoPythonBitcoinTuxDebianHTML5 badgeSaltStackUpset confused bugMoneyHackerUpset confused bugX.OrggitFirefoxWindowMakerBashIs it worth the time?i3 window managerWagtailContainerIrssiNginxSilenceUse a maskWorldInternet securityPianoFontGnuPGThunderbirdJenkinshome-assistant-logo