Thursday 1 December 2011

C# Removing or deleting an element and its contents from xml using regex

I was performaing a code review the other day where i cam upon this code

doc = rover.data.ArtisteDataHelper.getCurrentEventShows(festivalId);
XmlNodeList list = null;
list = doc.DocumentElement.GetElementsByTagName("biography");
while (list.Count > 0) {
      list[0].ParentNode.RemoveChild(list[0]);
}

list = doc.DocumentElement.GetElementsByTagName("description");

while (list.Count > 0) {
     list[0].ParentNode.RemoveChild(list[0]);
}

this operation was being performed over a 7mb file, and took around 18 seconds to complete. So, I thought I would see how fast it can remove these nodes using regular expressions.

I came up with the following code. It was designed specifically to remove elements with cdata inside them, but you can write regex to do anything you want of course, the performance gains will still be the same.

string xmlstring = doc.InnerXml.ToString();
string xmlstringresult = Regex.Replace(xmlstring, @"<description><!\[CDATA\[((?:[^]]|\](?!\]>))*)\]\]></description>", "");
string xmlstringresult2 = Regex.Replace(xmlstringresult, @"<biography><!\[CDATA\[((?:[^]]|\](?!\]>))*)\]\]></biography>", "");
result = new XmlDocument();
result.LoadXml(xmlstringresult2);

Result:
WHILE LOOP: 18 seconds
REGEX: 0.41 seconds