C#: Extract Hyperlinks from Word Documents

When working with Word documents, batch extraction of hyperlinks has significant practical applications. Manually extracting URLs from technical documents or product manuals is not only inefficient but also prone to omissions and errors. To address this, this article presents an automated solution using C# to accurately extract hyperlink anchor text, corresponding URLs, and screen tips by parsing document elements. The extracted hyperlink data can support data analysis, SEO optimization, and other applications. The following sections demonstrate how to use Spire.Doc for .NET to extract hyperlinks from a Word document with C# code in .NET programs.

Install Spire.Doc for .NET

To begin with, you need to add the DLL files included in the Spire.Doc for .NET package as references in your .NET project. The DLL files can be either downloaded from this link or installed via NuGet.

Package Manager

PM> Install-Package Spire.Doc

Extracting All Hyperlinks from a Word Document Using C#

In a Word document, hyperlinks are stored as fields. To extract them, the first step is to identify all field objects by checking whether each document object is an instance of the Field class. Then, by checking whether the field object's Type property equals FieldType.FieldHyperlink, we can extract all hyperlink fields.

Once the hyperlinks are identified, we can use the Field.FieldText property to retrieve the hyperlink anchor text and the Field.GetFieldCode() method to obtain the full field code in the following format:

Hyperlink Type	Field Code Example
Standard Hyperlink	HYPERLINK "https://www.example.com/example"
Hyperlink with ScreenTip	HYPERLINK "https://www.example.com/example" \o "ScreenTip"

By parsing the field code, we can extract both the hyperlink URL and the screen tip text, enabling complete retrieval of hyperlink information.

Create a Document object and use the Document.LoadFromFile() method to load the target Word document.
Iterate through all sections in the document using foreach (Section section in doc.Sections) to retrieve each section object.
For each section, iterate through its child objects using foreach (DocumentObject secObj in section.Body.ChildObjects) to access individual elements.
If a child object is of type Paragraph:
- Iterate through the child objects within the paragraph using foreach (DocumentObject paraObj in paragraph.ChildObjects).
If a paragraph child object is of type Field and its Field.Type property value equals FieldType.FieldHyperlink, process the Field object.
For each Field object:
- Extract the anchor text using the Field.FieldText property.
- Retrieve the field code string using the Field.GetFieldCode() method.
Process the field code string:
- Extract the URL enclosed in quotation marks after "HYPERLINK".
- Check if the field code contains the \o parameter; if present, extract the screen tip text enclosed in double quotes.
Store the extracted hyperlinks and write them to an output file.

using Spire.Doc;
using Spire.Doc.Documents;
using Spire.Doc.Fields;

namespace ExtractWordHyperlink
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create an instance of Document
            Document doc = new Document();
            // Load a Word document
            doc.LoadFromFile("Sample.docx");

            // Create a string list to store the hyperlink information
            List<string> hyperlinkInfoList = new List<string>();

            // Iterate through the sections in the document
            foreach (Section section in doc.Sections)
            {
                // Iterate through the child objects in the section
                foreach (DocumentObject secObj in section.Body.ChildObjects)
                {
                    // Check if the current document object is a Paragraph instance
                    if (secObj is Paragraph paragraph )
                    {
                        // Iterate through the child objects in the paragraph
                        foreach (DocumentObject paraObj in paragraph.ChildObjects)
                        {
                            // Check if the current child object is a field
                            if (paraObj is Field field && field.Type == FieldType.FieldHyperlink)
                            {
                                string hyperlinkInfo = "";
                                // Get the anchor text
                                string anchorText = field.FieldText;

                                // Get the field code
                                string fieldCode = field.GetFieldCode();
                                // Get the URL from the field code
                                string url = fieldCode.Split('"')[1];
                                // Check if there is a ScreenTip
                                if (fieldCode.Contains("\\o"))
                                {
                                    // Get the ScreenTip text
                                    string screenTip = fieldCode.Split("\"")[3].Trim();
                                    // Consolidate the information
                                    hyperlinkInfo += $"Anchor Text: {anchorText}\nURL: {url}\nScreenTip: {screenTip}";
                                }
                                else
                                {
                                    hyperlinkInfo += $"Anchor Text: {anchorText}\nURL: {url}";
                                }
                                hyperlinkInfo += "\n";
                                // Append the hyperlink information to the list
                                hyperlinkInfoList.Add(hyperlinkInfo);

                            }
                        }
                    }
                }
            }

            // Write the extracted hyperlink information to a text file
            File.WriteAllLines("output/ExtractedHyperlinks.txt", hyperlinkInfoList);

            doc.Close();
        }
    }
}

Hyperlinks Extracted from Word Documents Using C#

Apply for a Temporary License

If you'd like to remove the evaluation message from the generated documents, or to get rid of the function limitations, please request a 30-day trial license for yourself.

C#: Extract Hyperlinks from Word Documents

Install Spire.Doc for .NET

Extracting All Hyperlinks from a Word Document Using C#

Apply for a Temporary License

See Also