Loading XML into MongoDB

There are many situations where you may need to export data from XML to MongoDB.

Despite the fact that XML and JSON(B) formats used in MongoDB have much in common, they also have a number of differences that make them non-interchangeable.

Therefore, before you face the task of exporting data from XML to MongoDB, you will need to:

  1. Write your own XML parsing scripts;
  2. Use ETL tools.

Although modern language models can write parsing scripts quite well in languages like Python, these scripts will have a serious problem — they won’t be unified. For each file type, modern language models will generate a separate script. If you have more than one type of XML, this already creates significant problems in maintaining more than one parsing script.

The above problem is usually solved using specialized ETL tools. In this article, we will look at an ETL tool called SmartXML. Although SmartXML also supports converting XML to a relational representation we will only look at the process of uploading XML into MongoDB.  

The actual XML can be extremely large and complex. This article is an introductory article, so we will dissect a situation in which:

  1. All XML has the same structure;
  2. The logical model of the XML is the same as the storage model in MongoDB;
  3. Extracted fields don’t need complex processing;

We’ll cover those cases later, but first, let’s examine a simple example:

XML

 

In this example, we will upload in the MongoDB only the fields that serve a practical purpose, rather than the entire XML.

Create a New Project 

It is recommended to create a new project from the GUI. This will automatically create the necessary folder structure and parsing rules. A full description of the project structure can be found in the official documentation.

All parameters described in this article can be configured in graphical mode, but for clarity, we will focus on the textual representation.

In addition to the config.txt file with project settings, job.txt for batch work, the project itself consists of:

  1. Template of intermediate internal SmartDOM view, located in the project folder templates/data-templates.red.
  2. Rules for processing and transformation of SmartDOM itself, located in the rules folder.

Let’s consider the structure of data-templates.red:

Plain Text

 

Note

  1. The name sample is the name of the category, and it doesn’t matter.
  2. The marketing_data is the name of the subcategory. We need at least one code subcategory (subtype).
  3. The intermediate view names don’t require exact matches with XML tag names. In this example, we intentionally used the snake_case style.

Extract Rules

The rules are located in the rules directory in the project folder.

When working with MongoDB we will only be interested in two rules:

  1. tags-matching-rules.red — sets the matches between the XML tag tree and SmartDOM
  2. grow-rules.red — describes the relationship between SmartDOM nodes and real XML nodes
Plain Text

 

The key will be the name of the node in SmartDOM; the value will be an array containing the node spelling variants from the real XML file. In our example, these names are the same.

Ignored Tags

To avoid loading minor data into MongoDB in the example above, we create files in the ignores folder — one per section, named after each section. These files contain lists of tags to skip during extraction. For our example, we’ll have a sample.txt file containing:

Plain Text

 

As a result, when analyzing morphology, the intermediate representation will take the next form:

Plain Text

 

Note that after morphological analysis, only a minimal representation is shown containing data from the first found nodes. 

Here’s the JSON file that will be generated:

JSON

 

Configuring Connection to MongoDB

Since MongoDB doesn’t support direct HTTP data insertion, an intermediary service will be required.

Let’s install the dependencies: pip install flask pymongo.

The service itself:

Python

 

We’ll set up the MongoDB connection settings in the config.txt file (see nosql-url):

Plain Text

 

Remember that MongoDB will automatically create a database and a collection of the same name if they do not exist. However, this behavior may cause errors, and it is recommended to disable it by default.

Let’s run the service itself:

Python

 

Nextclick Parsethen Send JSON to NoSQL.

Send JSON to NoSQL
Now connect to the MongoDB console in any convenient way and execute the following commands:

Plain Text

 

The result should look like the following:

JSON

 

Conclusion

In this example, we have seen how we can automate the uploading of XML files to MongoDB without having to write any code. Although the example considers only one file, it is possible within the framework of one project to a huge number of types and subtypes of files with different structures, as well as to perform quite complex manipulations, such as type conversion and the use of external services to process field values in real time. This allows not only the unloading of data from XML but also the processing of some of the values via external API, including the use of large language models.

Source:
https://dzone.com/articles/loading-xml-into-mongodb