有许多情况下您可能需要将数据从XML导出到MongoDB。
尽管在MongoDB中使用的XML和JSON(B)格式有很多共同之处,但它们也有一些不同之处,使它们无法互换。
因此,在面临将数据从XML导出到MongoDB的任务之前,您需要:
- 编写自己的XML解析脚本;
- 使用ETL工具。
尽管现代语言模型可以很好地用诸如Python等语言编写解析脚本,但这些脚本会存在一个严重问题 —— 它们不会统一。对于每种文件类型,现代语言模型将生成单独的脚本。如果您有多种类型的XML,这将在维护多个解析脚本时带来重大问题。
上述问题通常使用专门的ETL工具来解决。在本文中,我们将介绍一种名为SmartXML的ETL工具。虽然SmartXML也支持将XML转换为关系表示,但我们将只关注将XML上传到MongoDB的过程。
实际的XML可能非常庞大和复杂。本文是一篇入门文章,因此我们将解析一个情况,即:
- 所有XML具有相同的结构;
- XML的逻辑模型与MongoDB中的存储模型相同;
- 提取的字段不需要复杂的处理;
我们稍后会讨论这些情况,但首先让我们看看一个简单的例子:
<marketingData>
<customer>
<name>John Smith</name>
<email>[email protected]</email>
<purchases>
<purchase>
<product>Smartphone</product>
<category>Electronics</category>
<price>700</price>
<store>TechWorld</store>
<location>New York</location>
<purchaseDate>2025-01-10</purchaseDate>
</purchase>
<purchase>
<product>Wireless Earbuds</product>
<category>Audio</category>
<price>150</price>
<store>GadgetStore</store>
<location>New York</location>
<purchaseDate>2025-01-11</purchaseDate>
</purchase>
</purchases>
<importantInfo>
<loyaltyStatus>Gold</loyaltyStatus>
<age>34</age>
<gender>Male</gender>
<membershipID>123456</membershipID>
</importantInfo>
<lessImportantInfo>
<browser>Chrome</browser>
<deviceType>Mobile</deviceType>
<newsletterSubscribed>true</newsletterSubscribed>
</lessImportantInfo>
</customer>
<customer>
<name>Jane Doe</name>
<email>[email protected]</email>
<purchases>
<purchase>
<product>Laptop</product>
<category>Electronics</category>
<price>1200</price>
<store>GadgetStore</store>
<location>San Francisco</location>
<purchaseDate>2025-01-12</purchaseDate>
</purchase>
<purchase>
<product>USB-C Adapter</product>
<category>Accessories</category>
<price>30</price>
<store>TechWorld</store>
<location>San Francisco</location>
<purchaseDate>2025-01-13</purchaseDate>
</purchase>
<purchase>
<product>Keyboard</product>
<category>Accessories</category>
<price>80</price>
<store>OfficeMart</store>
<location>San Francisco</location>
<purchaseDate>2025-01-14</purchaseDate>
</purchase>
</purchases>
<importantInfo>
<loyaltyStatus>Silver</loyaltyStatus>
<age>28</age>
<gender>Female</gender>
<membershipID>654321</membershipID>
</importantInfo>
<lessImportantInfo>
<browser>Safari</browser>
<deviceType>Desktop</deviceType>
<newsletterSubscribed>false</newsletterSubscribed>
</lessImportantInfo>
</customer>
<customer>
<name>Michael Johnson</name>
<email>[email protected]</email>
<purchases>
<purchase>
<product>Headphones</product>
<category>Audio</category>
<price>150</price>
<store>AudioZone</store>
<location>Chicago</location>
<purchaseDate>2025-01-05</purchaseDate>
</purchase>
</purchases>
<importantInfo>
<loyaltyStatus>Bronze</loyaltyStatus>
<age>40</age>
<gender>Male</gender>
<membershipID>789012</membershipID>
</importantInfo>
<lessImportantInfo>
<browser>Firefox</browser>
<deviceType>Tablet</deviceType>
<newsletterSubscribed>true</newsletterSubscribed>
</lessImportantInfo>
</customer>
<customer>
<name>Emily Davis</name>
<email>[email protected]</email>
<purchases>
<purchase>
<product>Running Shoes</product>
<category>Sportswear</category>
<price>120</price>
<store>FitShop</store>
<location>Los Angeles</location>
<purchaseDate>2025-01-08</purchaseDate>
</purchase>
<purchase>
<product>Yoga Mat</product>
<category>Sportswear</category>
<price>40</price>
<store>FitShop</store>
<location>Los Angeles</location>
<purchaseDate>2025-01-09</purchaseDate>
</purchase>
</purchases>
<importantInfo>
<loyaltyStatus>Gold</loyaltyStatus>
<age>25</age>
<gender>Female</gender>
<membershipID>234567</membershipID>
</importantInfo>
<lessImportantInfo>
<browser>Edge</browser>
<deviceType>Mobile</deviceType>
<newsletterSubscribed>false</newsletterSubscribed>
</lessImportantInfo>
</customer>
<customer>
<name>Robert Brown</name>
<email>[email protected]</email>
<purchases>
<purchase>
<product>Smartwatch</product>
<category>Wearable</category>
<price>250</price>
<store>GadgetPlanet</store>
<location>Boston</location>
<purchaseDate>2025-01-07</purchaseDate>
</purchase>
<purchase>
<product>Fitness Band</product>
<category>Wearable</category>
<price>100</price>
<store>HealthMart</store>
<location>Boston</location>
<purchaseDate>2025-01-08</purchaseDate>
</purchase>
</purchases>
<importantInfo>
<loyaltyStatus>Silver</loyaltyStatus>
<age>37</age>
<gender>Male</gender>
<membershipID>345678</membershipID>
</importantInfo>
<lessImportantInfo>
<browser>Chrome</browser>
<deviceType>Mobile</deviceType>
<newsletterSubscribed>true</newsletterSubscribed>
</lessImportantInfo>
</customer>
</marketingData>
在这个例子中,我们只会在MongoDB中上传具有实际用途的字段,而不是整个XML。
创建新项目
建议通过GUI创建新项目。这将自动创建必要的文件夹结构和解析规则。项目结构的完整描述可以在官方文档中找到。
本文中描述的所有参数都可以在图形模式下进行配置,但为了清晰起见,我们将重点关注文本表示。
除了包含项目设置的config.txt
文件和用于批处理的job.txt之外,项目本身还包括:
- 位于项目文件夹
templates/data-templates.red
中的中间内部SmartDOM
视图模板。. - 处理和转换
SmartDOM
本身的规则,位于rules
文件夹中。
让我们考虑一下data-templates.red
的结构:
#[
sample: #[
marketing_data: #[
customers: [
customer: [
name: none
email: none
purchases: [
purchase: [
product: none
category: none
price: none
store: none
location: none
purchase_date: none
]
]
]
]
]
]
]
注
- 名称
sample
是类别的名称,具体内容并不重要。 marketing_data
是子类别的名称。我们至少需要一个代码子类别(子类型)。- 中间视图名称不需要与 XML 标签名称完全匹配。在这个例子中,我们故意使用了
snake_case
风格。
提取规则
这些规则位于项目文件夹中的 rules
目录中。
在处理 MongoDB 时,我们只关心两个规则:
tags-matching-rules.red
— 设置 XML 标签树与 SmartDOM 之间的匹配grow-rules.red
— 描述 SmartDOM 节点与实际 XML 节点之间的关系
sample: [
purchase: ["purchase"]
customer: ["customer"]
]
键将是 SmartDOM 中节点的名称;值将是包含来自实际 XML 文件的节点拼写变体的数组。在我们的例子中,这些名称是相同的。
忽略的标签
为了避免在上面的示例中将次要数据加载到 MongoDB 中,我们在 ignores
文件夹中创建文件 — 每个部分一个文件,以每个部分命名。这些文件包含在提取过程中要跳过的标签列表。对于我们的示例,我们将有一个包含以下内容的 sample.txt
文件:
["marketingData" "customer" "lessImportantInfo" "browser"]
["marketingData" "customer" "lessImportantInfo" "deviceType"]
["marketingData" "customer" "lessImportantInfo" "newsletterSubscribed"]
因此,在分析形态时,中间表示将采用以下形式:
customers: [
customer: [
name: "John Smith"
email: "[email protected]"
loyalty_status: "Gold"
age: "34"
gender: "Male"
membership_id: "123456"
purchases: [
purchase: [
product: "Smartphone"
category: "Electronics"
price: "700"
store: "TechWorld"
location: "New York"
purchase_date: "2025-01-10"
]
]
]
]
请注意,在形态分析后,只显示包含来自第一个找到的节点的数据的最小表示。
以下是将生成的 JSON 文件:
{
"customers": [
{
"name": "John Smith",
"email": "[email protected]",
"loyalty_status": "Gold",
"age": "34",
"gender": "Male",
"membership_id": "123456",
"purchases": [
{
"product": "Smartphone",
"category": "Electronics",
"price": "700",
"store": "TechWorld",
"location": "New York",
"purchase_date": "2025-01-10"
},
{
"product": "Wireless Earbuds",
"category": "Audio",
"price": "150",
"store": "GadgetStore",
"location": "New York",
"purchase_date": "2025-01-11"
}
]
},
{
"name": "Jane Doe",
"email": "[email protected]",
"loyalty_status": "Silver",
"age": "28",
"gender": "Female",
"membership_id": "654321",
"purchases": [
{
"product": "Laptop",
"category": "Electronics",
"price": "1200",
"store": "GadgetStore",
"location": "San Francisco",
"purchase_date": "2025-01-12"
},
{
"product": "USB-C Adapter",
"category": "Accessories",
"price": "30",
"store": "TechWorld",
"location": "San Francisco",
"purchase_date": "2025-01-13"
},
{
"product": "Keyboard",
"category": "Accessories",
"price": "80",
"store": "OfficeMart",
"location": "San Francisco",
"purchase_date": "2025-01-14"
}
]
},
{
"name": "Michael Johnson",
"email": "[email protected]",
"loyalty_status": "Bronze",
"age": "40",
"gender": "Male",
"membership_id": "789012",
"purchases": [
{
"product": "Headphones",
"category": "Audio",
"price": "150",
"store": "AudioZone",
"location": "Chicago",
"purchase_date": "2025-01-05"
}
]
},
{
"name": "Emily Davis",
"email": "[email protected]",
"loyalty_status": "Gold",
"age": "25",
"gender": "Female",
"membership_id": "234567",
"purchases": [
{
"product": "Running Shoes",
"category": "Sportswear",
"price": "120",
"store": "FitShop",
"location": "Los Angeles",
"purchase_date": "2025-01-08"
},
{
"product": "Yoga Mat",
"category": "Sportswear",
"price": "40",
"store": "FitShop",
"location": "Los Angeles",
"purchase_date": "2025-01-09"
}
]
},
{
"name": "Robert Brown",
"email": "[email protected]",
"loyalty_status": "Silver",
"age": "37",
"gender": "Male",
"membership_id": "345678",
"purchases": [
{
"product": "Smartwatch",
"category": "Wearable",
"price": "250",
"store": "GadgetPlanet",
"location": "Boston",
"purchase_date": "2025-01-07"
},
{
"product": "Fitness Band",
"category": "Wearable",
"price": "100",
"store": "HealthMart",
"location": "Boston",
"purchase_date": "2025-01-08"
}
]
}
]
}
配置连接到 MongoDB
由于 MongoDB 不支持直接的 HTTP 数据插入,因此将需要一个中间服务。
让我们安装依赖项:pip install flask pymongo
。
服务本身:
from flask import Flask, request, jsonify
from pymongo import MongoClient
import json
app = Flask(__name__)
# Connection to MongoDB
client = MongoClient('mongodb://localhost:27017')
db = client['testDB']
collection = db['testCollection']
route('/insert', methods=['POST']) .
def insert_document():
try:
# Flask will automatically parse JSON if Content-Type: application/json
data = request.get_json()
if not data:
return jsonify({"error": "Empty JSON payload"}), 400
result = collection.insert_one(data)
return jsonify({"insertedId": str(result.inserted_id)}), 200
except Exception as e:
import traceback
print(traceback.format_exc())
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(port=3000)
我们将在config.txt
文件中设置MongoDB连接设置(参见nosql-url
):
job-number: 1
root-xml-folder: "D:/data/data-samples"
xml-filling-stat: false ; table: filling_percent_stat should exists
ignore-namespaces: false
ignore-tag-attributes: false
use-same-morphology-for-same-file-name-pattern: false
skip-schema-version-tag: true
use-same-morphology-for-all-files-in-folder: false
delete-data-before-insert: none
connect-to-db-at-project-opening: true
source-database: "SQLite" ; available values: PostgreSQL/SQLite
target-database: "SQLite" ; available values: PostgreSQL/SQLite/NoSQL
bot-chatID: ""
bot-token: ""
telegram-notifications: true
db-driver: ""
db-server: "127.0.0.1"
db-port: ""
db-name: ""
db-user: ""
db-pass: ""
sqlite-driver-name: "SQLite3 ODBC Driver"
sqlite-db-path: ""
nosql-url: "http://127.0.0.1:3000/insert"
append-subsection-name-to-nosql-url: false
no-sql-login: "" ; login and pass are empty
no-sql-pass: ""
请记住,如果数据库和集合不存在,MongoDB会自动创建同名的数据库和集合。然而,这种行为可能会导致错误,建议默认禁用它。
让我们运行服务本身:
python .\app.py
接下来,点击解析,然后发送JSON到NoSQL。
现在以任何方便的方式连接到MongoDB控制台,并执行以下命令:
show databases
admin 40.00 KiB
config 72.00 KiB
local 72.00 KiB
testDB 72.00 KiB
use testDB
switched to db testDB
db.testCollection.find().pretty()
结果应该如下所示:
{
_id: ObjectId('278e1b2c7c1823d4fde120ef'),
customers: [
{
name: 'John Smith',
email: '[email protected]',
loyalty_status: 'Gold',
age: '34',
gender: 'Male',
membership_id: '123456',
purchases: [
{
product: 'Smartphone',
category: 'Electronics',
price: '700',
store: 'TechWorld',
location: 'New York',
purchase_date: '2025-01-10'
},
{
product: 'Wireless Earbuds',
category: 'Audio',
price: '150',
store: 'GadgetStore',
location: 'New York',
purchase_date: '2025-01-11'
}
]
},
{
name: 'Jane Doe',
email: '[email protected]',
loyalty_status: 'Silver',
age: '28',
gender: 'Female',
membership_id: '654321',
purchases: [
{
product: 'Laptop',
category: 'Electronics',
price: '1200',
store: 'GadgetStore',
location: 'San Francisco',
purchase_date: '2025-01-12'
},
{
product: 'USB-C Adapter',
category: 'Accessories',
price: '30',
store: 'TechWorld',
location: 'San Francisco',
purchase_date: '2025-01-13'
},
{
product: 'Keyboard',
category: 'Accessories',
price: '80',
store: 'OfficeMart',
location: 'San Francisco',
purchase_date: '2025-01-14'
}
]
},
{
name: 'Michael Johnson',
email: '[email protected]',
loyalty_status: 'Bronze',
age: '40',
gender: 'Male',
membership_id: '789012',
purchases: [
{
product: 'Headphones',
category: 'Audio',
price: '150',
store: 'AudioZone',
location: 'Chicago',
purchase_date: '2025-01-05'
}
]
},
{
name: 'Emily Davis',
email: '[email protected]',
loyalty_status: 'Gold',
age: '25',
gender: 'Female',
membership_id: '234567',
purchases: [
{
product: 'Running Shoes',
category: 'Sportswear',
price: '120',
store: 'FitShop',
location: 'Los Angeles',
purchase_date: '2025-01-08'
},
{
product: 'Yoga Mat',
category: 'Sportswear',
price: '40',
store: 'FitShop',
location: 'Los Angeles',
purchase_date: '2025-01-09'
}
]
},
{
name: 'Robert Brown',
email: '[email protected]',
loyalty_status: 'Silver',
age: '37',
gender: 'Male',
membership_id: '345678',
purchases: [
{
product: 'Smartwatch',
category: 'Wearable',
price: '250',
store: 'GadgetPlanet',
location: 'Boston',
purchase_date: '2025-01-07'
},
{
product: 'Fitness Band',
category: 'Wearable',
price: '100',
store: 'HealthMart',
location: 'Boston',
purchase_date: '2025-01-08'
}
]
}
]
}
结论
在这个例子中,我们看到了如何在不编写任何代码的情况下自动将XML文件上传到MongoDB。 虽然这个例子只考虑了一个文件,但在一个项目的框架内,可以处理大量不同结构的文件类型和子类型,并执行相当复杂的操作,比如类型转换以及使用外部服务实时处理字段值。 这不仅允许从XML中卸载数据,还可以通过外部API处理一些值,包括使用大型语言模型。