HtmlAgilityPack 操作详解

PLA12147111

已于 2024-12-26 05:05:23 修改

阅读量2.3k

点赞数 26

文章标签： C#

于 2024-10-28 15:36:05 首次发布

本文链接：https://blog.csdn.net/PLA12147111/article/details/143301014

版权

1. 安装 HtmlAgilityPack

2. 示例 HTML

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

5.常用的几种获取元素的 XPath 写法

5. Xpath 特殊符合详解( /、//、./、.//、@、*、[条件]、last() 等符号的含义和用法)

7. 选择集合中的最后一个节点：last()

8. 选择集合中的特定位置节点：position()

9. 结合逻辑运算：and、or

HtmlAgilityPack：

轻量且高效，适合进行常规的 HTML 解析。
由于其轻量化设计，在只需简单提取或修改元素内容时，HtmlAgilityPack 会显得更快。
对于层级较深或大规模的 HTML 文档，HtmlAgilityPack 也会处理得较为流畅。
文件大小较小，功能单一，适用于解析 HTML 和使用 XPath 查询。
没有内置对 CSS 选择器的支持，需要通过额外库扩展（如 Fizzler）。

1. 安装 HtmlAgilityPack

通过 NuGet 包管理器安装 HtmlAgilityPack：

2. 示例 HTML

假设我们有以下 HTML 内容，需要解析和操作：

 <!DOCTYPE html>
        <html>
        <head>
            <title>HtmlAgilityPack Example</title>
            <style>
                .highlight { color: yellow; }
                #main { background-color: #f0f0f0; }
            </style>
        </head>
        <body>
            <h1 id='main-heading' class='highlight'>Welcome to HtmlAgilityPack</h1>
            <p>This is a <span class='highlight'>simple</span> example.</p>
            <a href='https://example.com' target='_blank'>Visit Example.com</a>
            <ul id='items'>
                <li class='item'>Item 1</li>
                <li class='item'>Item 2</li>
                <li class='item'>Item 3</li>
            </ul>
            <input type='text' id='username' value='JohnDoe' />
            <input type='password' id='password' />
        </body>
        </html>

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

以下是一个详细的 C# 示例，展示如何使用 HtmlAgilityPack 进行各种操作：

using HtmlAgilityPack;
using System;
using System.Linq;

class Program
{
    static void Main(string[] args)
    {
        // 示例 HTML 内容
        string html = @"
        <!DOCTYPE html>
        <html>
        <head>
            <title>HtmlAgilityPack Example</title>
            <style>
                .highlight { color: yellow; }
                #main { background-color: #f0f0f0; }
            </style>
        </head>
        <body>
            <h1 id='main-heading' class='highlight'>Welcome to HtmlAgilityPack</h1>
            <p>This is a <span class='highlight'>simple</span> example.</p>
            <a href='https://example.com' target='_blank'>Visit Example.com</a>
            <ul id='items'>
                <li class='item'>Item 1</li>
                <li class='item'>Item 2</li>
                <li class='item'>Item 3</li>
            </ul>
            <input type='text' id='username' value='JohnDoe' />
            <input type='password' id='password' />
        </body>
        </html>";

        // 1. **加载 HTML 文档**
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        // 2. **选择元素**
        // 使用 XPath 选择所有具有 class 'highlight' 的元素
        var highlights = document.DocumentNode.SelectNodes("//*[@class='highlight']");
        Console.WriteLine("Elements with class 'highlight':");
        foreach (var elem in highlights)
        {
            Console.WriteLine($"- <{elem.Name}>: {elem.InnerText}");
        }

        // 使用 ID 选择器选择特定元素
        var mainHeading = document.GetElementbyId("main-heading");
        if (mainHeading != null)
        {
            Console.WriteLine($"\nElement with ID 'main-heading': {mainHeading.InnerText}");
        }

        // 选择所有 <a> 标签
        var links = document.DocumentNode.SelectNodes("//a");
        Console.WriteLine("\nAll <a> elements:");
        foreach (var link in links)
        {
            Console.WriteLine($"- Text: {link.InnerText}, Href: {link.GetAttributeValue("href", "")}, Target: {link.GetAttributeValue("target", "")}");
        }

        // 选择所有具有 class 'item' 的 <li> 元素
        var items = document.DocumentNode.SelectNodes("//li[@class='item']");
        Console.WriteLine("\nList items with class 'item':");
        foreach (var item in items)
        {
            Console.WriteLine($"- {item.InnerText}");
        }

        // 选择特定类型的输入元素
        var textInput = document.DocumentNode.SelectSingleNode("//input[@type='text']");
        var passwordInput = document.DocumentNode.SelectSingleNode("//input[@type='password']");
        Console.WriteLine($"\nText Input Value: {textInput.GetAttributeValue("value", "")}");
        Console.WriteLine($"Password Input Value: {passwordInput.GetAttributeValue("value", "")}");

        // 3. **提取和修改属性**
        // 获取第一个链接的 href 属性
        string firstLinkHref = links.First().GetAttributeValue("href", "");
        Console.WriteLine($"\nFirst link href: {firstLinkHref}");

        // 修改第一个链接的 href 属性
        links.First().SetAttributeValue("href", "https://newexample.com");
        Console.WriteLine($"Modified first link href: {links.First().GetAttributeValue("href", "")}");

        // 4. **提取和修改文本内容**
        // 获取第一个段落的文本内容
        var firstParagraph = document.DocumentNode.SelectSingleNode("//p");
        Console.WriteLine($"\nFirst paragraph text: {firstParagraph.InnerText}");

        // 修改第一个段落的文本内容
        firstParagraph.InnerHtml = "This is an <strong>updated</strong> example.";
        Console.WriteLine($"Modified first paragraph HTML: {firstParagraph.InnerHtml}");

        // 5. **操作样式**
        // 获取元素的 class 属性
        string h1Classes = mainHeading.GetAttributeValue("class", "");
        Console.WriteLine($"\nMain heading classes: {h1Classes}");

        // 添加一个新的 class
        mainHeading.SetAttributeValue("class", h1Classes + " new-class");
        Console.WriteLine($"Main heading classes after adding 'new-class': {mainHeading.GetAttributeValue("class", "")}");

        // 移除一个 class (手动实现，HtmlAgilityPack 不支持内置的 class 操作)
        h1Classes = mainHeading.GetAttributeValue("class", "").Replace("highlight", "").Trim();
        mainHeading.SetAttributeValue("class", h1Classes);
        Console.WriteLine($"Main heading classes after removing 'highlight': {mainHeading.GetAttributeValue("class", "")}");

        // 6. **遍历和查询 DOM**
        // 遍历所有子节点的标签名
        Console.WriteLine("\nChild elements of <body>:");
        var bodyChildren = document.DocumentNode.SelectSingleNode("//body").ChildNodes;
        foreach (var child in bodyChildren)
        {
            if (child.NodeType == HtmlNodeType.Element)
            {
                Console.WriteLine($"- <{child.Name}>");
            }
        }

        // 查找包含特定文本的元素
        var elementsWithText = document.DocumentNode.SelectNodes("//*[contains(text(), 'simple')]");
        Console.WriteLine("\nElements containing the text 'simple':");
        foreach (var elem in elementsWithText)
        {
            Console.WriteLine($"- <{elem.Name}>: {elem.InnerText}");
        }

        // 7. **生成和输出修改后的 HTML**
        string modifiedHtml = document.DocumentNode.OuterHtml;
        Console.WriteLine("\nModified HTML:");
        Console.WriteLine(modifiedHtml);
    }
}

4. 代码详解

1.加载html文档

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

2.选择元素

使用 XPath 选择所有具有相同特征的元素集合 .SelectNodes("XPath");
```
var elements = document.DocumentNode.SelectNodes("//*[@class='class']");
```

通过 XPath 选择具有独立性的单一元素 .SelectSingleNode("XPath");

var div = document.DocumentNode.SelectSingleNode("//div[@id='title-content']");

使用 ID 选择器选择特定元素 .GetElementbyId("id");
```
var element = document.GetElementbyId("id");
```
获取子节点（注意这里是直接子节点集合，即第一级的子节点。不包括更深层次的子孙节点。）.ChildNodes;
```
var bodyChildren = document.DocumentNode.SelectSingleNode("//body").ChildNodes;
```
获取元素的第一个子节点 .First();
```
var firstChildNode = element.First();
```

3.提取属性

假设我们要对下面这个 element 进行操作

var element = document.GetElementbyId("id");

提取元素内部 html
```
string innerHtml = element.InnerHtml;
```
提取含元素自身的 html
```
string outerHtml = element.OuterHtml;
```
提取文本
```
string text= element.InnerText;
```

提取属性

string _value = element.GetAttributeValue("value", "");

提取 href

string href = element.GetAttributeValue("href", "");

4.修改属性

修改 href

element.SetAttributeValue("href", "https://newexample.com");

添加 class

 element.SetAttributeValue("class", oldClasses + " new-class");

修改 class

// 移除一个 class (手动实现，HtmlAgilityPack 不支持内置的 class 操作)
newClasses = element.GetAttributeValue("class", "").Replace("highlight", "").Trim();
element.SetAttributeValue("class", newClasses);

5.常用的几种获取元素的 XPath 写法

通过 id 获取

var element = document.DocumentNode.SelectSingleNode("//*[@id='id']");

通过 class 获取

var element = document.DocumentNode.SelectNodes("//*[@class='class']");

通过匹配文本获取

var elementsWithText = document.DocumentNode.SelectNodes("//*[contains(text(), 'simple')]");

通过 class 和匹配文本相结合获取

var elements = doc.DocumentNode.SelectNodes("//span[@class='title-content-title' and contains(text(), '包含的文本')]");

5. Xpath 特殊符合详解( /、//、./、.//、@、*、[条件]、last() 等符号的含义和用法)

假设我们有以下 XML 文档：

<root>
   <bookstore>
        <book id="1">
            <title>Book 1</title>
        </book>
        <book id="2">
            <title>Book 2</title>
        </book>
    </bookstore>
</root>

1. / 和 // 的区别

/：表示从根节点开始的绝对路径。
//：表示从文档的任何位置开始的相对路径，选择匹配的所有后代节点。

使用 / 从根节点选择 title 节点：

using System;
using System.Xml;

class Program
{
    static void Main()
    {
        string xml = @"
        <root>
            <bookstore>
                <book>
                    <title>Book 1</title>
                </book>
                <book>
                    <title>Book 2</title>
                </book>
            </bookstore>
        </root>";

        XmlDocument doc = new XmlDocument();
        doc.LoadXml(xml);

        // 从根节点开始选择 title
        XmlNodeList nodes = doc.SelectNodes("/root/bookstore/book/title");
        foreach (XmlNode node in nodes)
        {
            Console.WriteLine(node.InnerText); // 输出：Book 1, Book 2
        }
    }
}

使用 // 选择所有的 title 节点：

XmlNodeList nodes = doc.SelectNodes("//title");
foreach (XmlNode node in nodes)
{
    Console.WriteLine(node.InnerText); // 输出：Book 1, Book 2
}

2. ./ 和 .// 的区别

.：表示当前节点。
./：表示从当前节点的直接子节点中查找。
.//：表示从当前节点及其所有后代节点中查找。

从特定节点选择子节点或后代节点：

// 获取 bookstore 节点
XmlNode bookstoreNode = doc.SelectSingleNode("/root/bookstore");

// 使用 ./ 查找直接子节点
XmlNodeList directTitles = bookstoreNode.SelectNodes("./book/title");
foreach (XmlNode title in directTitles)
{
    Console.WriteLine(title.InnerText); // 输出：Book 1, Book 2
}

// 使用 .// 查找所有后代节点
XmlNodeList allTitles = bookstoreNode.SelectNodes(".//title");
foreach (XmlNode title in allTitles)
{
    Console.WriteLine(title.InnerText); // 输出：Book 1, Book 2
}

3. @（选择属性:用于选择节点的属性）

选择 book 节点的 id 属性：

XmlNodeList ids = doc.SelectNodes("//book/@id");
foreach (XmlNode id in ids)
{
    Console.WriteLine(id.Value); // 输出：1, 2
}

4. *（通配符:匹配任意节点）

选择所有直接子节点：

XmlNodeList nodes = doc.SelectNodes("/root/bookstore/*");
foreach (XmlNode node in nodes)
{
    Console.WriteLine(node.Name); // 输出：book
}

5. [条件]（谓词:用来过滤节点）

选择 id=1 的 book 节点：[@id='1']

XmlNode node = doc.SelectSingleNode("//book[@id='1']");
Console.WriteLine(node.InnerXml); // 输出：<title>Book 1</title>

6. 选择节点的文本内容：text()

选择 title 的文本内容：

XmlNodeList titles = doc.SelectNodes("//title/text()");
foreach (XmlNode title in titles)
{
    Console.WriteLine(title.Value); // 输出：Book 1, Book 2
}

7. 选择集合中的最后一个节点：last()

选择最后一个 book 节点：

XmlNode node = doc.SelectSingleNode("//book[last()]");
Console.WriteLine(node.InnerXml); // 输出：<title>Book 2</title>

8. 选择集合中的特定位置节点：position()

选择第二个 book 节点：position()

XmlNode node = doc.SelectSingleNode("//book[position()=2]");
Console.WriteLine(node.InnerXml); // 输出：<title>Book 2</title>

9. 结合逻辑运算：and、or

XmlNode node = doc.SelectSingleNode("//book[@id='1' and title='Book 1']");
Console.WriteLine(node.InnerXml); // 输出：<title>Book 1</title>

小结

/ 和 //：用于绝对路径和相对路径查找。
./ 和 .//：用于从当前节点查找直接子节点或所有后代节点。
@、*、[条件]、last() 等特殊符号可以灵活用于筛选和定位节点。
C# 的 XmlDocument 和 SelectNodes 方法是使用 XPath 查找节点的主要工具。