Pandas DataFrame read_xml() Method



The Python Pandas library provides the read_xml() method to read data from an XML document and convert it into a Pandas DataFrame object. This method is a powerful tool for handling structured XML data in tabular form, enabling users to process and analyze XML data effectively.

XML (Extensible Markup Language) is a popular data exchange format, often used for hierarchical and structured data. With the read_xml() method, you can extract XML data into Pandas for further manipulation and analysis. This method also provides various options for handling complex XML structures and attributes.

Syntax

The syntax of the read_xml() method is as follows −

pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, names=None, dtype=None, converters=None, parse_dates=None, encoding='utf-8', parser='lxml', stylesheet=None, iterparse=None, compression='infer', storage_options=None, dtype_backend=<no_default>

Parameters

The Python Pandas read_xml() method accepts the following parameters −

  • path_or_buffer: The file path, URL, or file-like object containing the XML data.

  • xpath: A string that specifies the XPath to parse specific nodes in the XML. Default is "./*".

  • namespaces: A dictionary to map prefixes to namespace URIs for XPath.

  • elems_only: If True, parses only child elements.

  • attrs_only: If set to True, parses only attributes at the specified XPath.

  • names: Determines the column names for parsing the XML data into DataFrame.

  • dtype: Specifies the data type for columns in resultant DataFrame.

  • converters: It takes a dictionary of functions for converting values in certain columns.

  • parser: The XML parser to use. Options include 'lxml' (default) and 'etree'.

  • stylesheet: Path to an XSLT stylesheet to transform the XML before parsing.

  • dtype_backend: Specifies the data backend for nullable types.

  • encoding: The encoding of the input file. Default is 'utf-8'.

  • compression: Indicates if the XML file is compressed. Options include 'infer', 'gzip', 'bz2', 'zip', etc.

  • storage_options: Additional options for remote storage connections.

Return Value

The read_xml() method returns a Pandas DataFrame containing the parsed data from the XML document.

Example: Reading a Simple XML File

Here is a basic example that demonstrates how to use the read_xml() method to read an XML file and convert it to a DataFrame.

import pandas as pd

# Create an XML Document first
# Sample DataFrame
df = pd.DataFrame({'name': ['Tanmay', 'Manisha'],
'company': ['TutorialsPoint', 'TutorialsPoint'],
'phone': ['(011) 123-4567', '(011) 789-4567']
})

# Save DataFrame to XML 
df.to_xml("simple_data.xml")

# Read XML data from a file
df = pd.read_xml('simple_data.xml')

print("DataFrame:")
print(df)

The output of the above code will be −


index name company phone
0 0 Tanmay TutorialsPoint (011) 123-4567
1 1 Manisha TutorialsPoint (011) 789-4567

Example: Reading an XML String

Instead of reading XML data from a local file, the following example demonstrates reading a string that representing the XML data into the Pandas DataFrame.

import pandas as pd
from io import StringIO

# Create an XML string
xml_string ="""<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <Name>Kiran</Name>
    <Age>25</Age>
    <City>New Delhi</City>
  </row>
  <row>
    <index>1</index>
    <Name>Priya</Name>
    <Age>30</Age>
    <City>Hyderabad</City>
  </row>
  <row>
    <index>2</index>
    <Name>Naveen</Name>
    <Age>35</Age>
    <City>Chennai</City>
  </row>
</data>
"""
# Read XML string 
df = pd.read_xml(StringIO(xml_string))

print("DataFrame from XML string:")
print(df)

Output of the above code is as follows −

DataFrame from XML string:
index Name Age City
0 0 Kiran 25 New Delhi
1 1 Priya 30 Hyderabad
2 2 Naveen 35 Chennai

Example: Reading XML Data with Custom XPath

This example demonstrates how to use a custom XPath query to extract specific elements from an XML document. The following example reads only the "title" element from the XML data using the xpath parameter.

import pandas as pd
from io import StringIO

# Create an XML String
xml = """<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>"""

# Read XML data with a custom XPath
df = pd.read_xml(StringIO(xml), xpath=".//title")

# Diaplay the Output DataFrame
print("Output DataFrame with custom XPath:")
print(df)

Following is an output of the above code −

Output DataFrame with custom XPath:
lang title
0 en Everyday Italian
1 en Harry Potter
2 en Learning XML

Example: Reading Compressed XML Files

The compression parameter allows reading compressed XML files. The following example shows how to read a compressed XML file using the compression parameter.

import pandas as pd

# Create an XML Document first
# Sample DataFrame
df = pd.DataFrame({'name': ['Tanmay', 'Manisha'],
'company': ['TutorialsPoint', 'TutorialsPoint'],
'phone': ['(011) 123-4567', '(011) 789-4567']
})

# Save DataFrame to compressed XML 
df.to_xml("compressed_data.xml.gz", compression='gzip')

# Read a compressed XML file
df = pd.read_xml('compressed_data.xml.gz', compression='gzip')

print("DataFrame from compressed XML:")
print(df)

Output of the above code is as follows −

DataFrame from compressed XML:
index name company phone
0 0 Tanmay TutorialsPoint (011) 123-4567
1 1 Manisha TutorialsPoint (011) 789-4567

Example: Parsing XML with Custom Date Parsing

This example uses the pandas.read_xml() method to handle nullable types and date parsing using the dtype_backend and parse_dates parameters respectively.

import pandas as pd
from io import StringIO

# XML string with timestamps
xml_content = '''<data>
    <record>
        <id>1</id>
        <value>3.14</value>
        <flag>True</flag>
        <label>X</label>
        <timestamp>2025-01-01 12:00:00</timestamp>
    </record>
    <record>
        <id>2</id>
        <value>6.28</value>
        <flag>False</flag>
        <label>Y</label>
        <timestamp>2025-01-02 12:00:00</timestamp>
    </record>
</data>
'''

# Parsing the XML data
df = pd.read_xml(StringIO(xml_content), 
                 dtype_backend="numpy_nullable", 
                 parse_dates=["timestamp"])

# Diaplay the Output DataFrame
print("Output DataFrame:")
print(df)

Following is an output of the above code −

Output DataFrame:
id value flag label timestamp
0 1 3.14 True X 2025-01-01 12:00:00
1 2 6.28 False Y 2025-01-02 12:00:00
python_pandas_io_tool.htm
Advertisements