How to Parse and Process HTML in PHP

Here we will see how to parse and process HTML or XML in PHP. To do this we will use the DOMDocument class.

DOMDocument is a built-in class in PHP and you need to install or enable php-xml to use it.

sudo apt install php-xml

Or Open php.ini and uncomment (remove the semicolon) the following line:

extension=php_xml.dll

Here’s a step-by-step guide on how to use it:

1. Load HTML:

To start, you need to create a new instance of DOMDocument and load the HTML content. If you’re working with HTML, it’s often a good idea to suppress errors due to malformed HTML.

<?php
$html = '<html><body><h1>Welcome</h1><p>This is a sample paragraph.</p></body></html>';

$dom = new DOMDocument();

// Suppress warnings due to invalid HTML
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

2. Access Elements:

Once you have loaded the HTML, you can access elements using various methods. For instance, you can use getElementsByTagName to get all elements of a certain type.

<?php
$html = '<html><body><h1 id="heading">Welcome</h1><p>This is a sample paragraph.</p></body></html>';

$dom = new DOMDocument();

// Suppress warnings due to invalid HTML
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();


// Get all <h1> elements
$h1Tags = $dom->getElementsByTagName('h1');
foreach ($h1Tags as $h1) {
    echo $h1->nodeValue . "\n"; // Output: Welcome
}

// Get all <p> elements
$pTags = $dom->getElementsByTagName('p');
foreach ($pTags as $p) {
    echo $p->nodeValue . "\n"; // Output: This is a sample paragraph.
}

// Select by ID
$heading = $dom->getElementById('heading');
if ($heading) {
    echo $heading->nodeValue . "\n";  // Output: Welcome
}

3. Select Elements:

If you want to select elements more efficiently, you can use DOMXPath with DOMDocument.

Here is an example:

<?php
// Load the HTML document
$html = '<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Example Document</title>
</head>
<body>
    <div id="container">
        <h1 class="header">Welcome</h1>
        <p class="content">This is a paragraph.</p>
        <p class="content special">This is a special paragraph.</p>
        <a href="#" class="link">Click me</a>
    </div>
</body>
</html>';

$doc = new DOMDocument();
libxml_use_internal_errors(true); // Suppress warnings for HTML5 tags
$doc->loadHTML($html);
libxml_clear_errors();

// Create a new DOMXPath object
$xpath = new DOMXPath($doc);

// 1. Select by ID
$idSelector = $xpath->query('//*[@id="container"]');
foreach ($idSelector as $element) {
    echo "ID Selector: " . $element->nodeName . "\n";
}

// 2. Select by Class
$classSelector = $xpath->query('//*[@class="content"]');
foreach ($classSelector as $element) {
    echo "Class Selector: " . $element->nodeName . " - " . $element->textContent . "\n";
}

// 3. Select by Attribute
$attributeSelector = $xpath->query('//p[@class="content" and contains(@class, "special")]');
foreach ($attributeSelector as $element) {
    echo "Attribute Selector: " . $element->nodeName . " - " . $element->textContent . "\n";
}
?>

Output:

ID Selector: div
Class Selector: p - This is a paragraph.
Class Selector: p - This is a special paragraph.
Attribute Selector: p - This is a special paragraph.

4. Modify Elements:

You can also modify elements, such as changing text or attributes.

// Change the content of the first <h1> element
if ($h1Tags->length > 0) {
    $h1Tags->item(0)->nodeValue = 'Hello, World!';
}

// Adding a new <p> element
$newP = $dom->createElement('p', 'This is a new paragraph.');
$dom->getElementsByTagName('body')->item(0)->appendChild($newP);

5. Save Changes:

After making modifications, you might want to save the changes back to a string or output it directly.

echo $dom->saveHTML(); // Outputs the modified HTML

Example: Full Script

Here’s a complete example that includes loading HTML, accessing, modifying, and saving it:

<?php
$html = '<html><body><h1>Welcome</h1><p>This is a sample paragraph.</p></body></html>';

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();

// Access and modify elements
$h1Tags = $dom->getElementsByTagName('h1');
if ($h1Tags->length > 0) {
    $h1Tags->item(0)->nodeValue = 'Hello, World!';
}

$newP = $dom->createElement('p', 'This is a new paragraph.');
$dom->getElementsByTagName('body')->item(0)->appendChild($newP);

// Save and output the modified HTML
echo $dom->saveHTML();
?>