2026 01 06 htmlsemanticpreservingsplitter preserved elements ignores child elements
HTML Semantic Preserving Splitter Ignores Child Elements
The HTMLSemanticPreservingSplitter in LangChain has an issue where it ignores child elements that are not top-level when preserving elements. This can lead to unexpected behavior and incorrect splitting of HTML content.
Problem Description
When using the HTMLSemanticPreservingSplitter, we expect it to preserve top-level elements, but it seems to ignore child elements instead. This is evident in the example code provided, where the splitter correctly splits the text into pages, but fails to preserve the body element and its contents.
Example Code
from langchain_text_splitters import HTMLSemanticPreservingSplitter
body = """
<p>Hello1<body>nest\n\n\n\nedbody</body></p>
<p>Hello2</p>
<p>Hello3</p>
<p>Hello4</p>
<p>Hello5</p>
<p>Hello6</p>
<p>Hello7</p>
<p>Hello8</p>
<p>Hello9</p>
<p>Hello10</p>
<p>Hello11</p>
<p>Hello12</p>
<p>Hello13</p>
<p>Hello14</p>
"""
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[],
elements_to_preserve=["body"],
)
if __name__ == "__main__":
print(splitter.split_text(body))
Output
The output of the provided example code is:
Document(metadata={}, page_content='Hello1 nest edbody Hello2 Hello3 Hello4 Hello5 Hello6 Hello7 Hello8 Hello9 Hello10 Hello11 Hello12 Hello13 Hello14')
body element and its contents are not preserved.
Error Message
The error message from LangChain is:
It looks like your HTML contains improperly nested tags. Specifically, there's a <body> tag inside a <p> tag, which is invalid HTML.
Solution
To fix this issue, we need to modify the HTMLSemanticPreservingSplitter to correctly preserve child elements. One possible solution is to add an additional check to ensure that the preserved element is a top-level element.
from langchain_text_splitters import HTMLSemanticPreservingSplitter
class HTMLSemanticPreservingSplitter(HTMLSemanticPreservingSplitter):
def should_preserve(self, element):
if element.tag == "body":
return True
elif element parents and element.parent.tag != "html":
return False
else:
return super().should_preserve(element)
body element and its contents.
Conclusion
The HTMLSemanticPreservingSplitter in LangChain has an issue where it ignores child elements that are not top-level. By modifying the splitter to correctly handle child elements, we can ensure accurate splitting of HTML content.