【问题】
C#中,中HtmlAgilityPack,去解析:
http://www.amazon.com/Kindle-Fire-HD/dp/B0083PWAPW/ref=lp_1055398_1_2?ie=UTF8&qid=1369721900&sr=1-2
的html中的:
| World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini) |
时,发现对应的源码是:
<span>World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (<a href="#" id="kpp-popover-0">compared to the iPad mini</a><script type="text/javascript"> + ‘<img src="http://g-ec2.images-amazon.com/images/G/01/kindle/dp/2012/KT/tate_feature-wifi._V395653267_.gif"/>’ }); </script>)</span> |
然后用HtmlAgilityPack解析后,结果发现其中的InnerText却是:
| World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini\namznJQ.available(‘jQuery’, function() { \n(function ($) {\namznJQ.available(‘popover’, function() {\n\tvar content = ‘<h2 style=\"font-size: 17px;\">Two Antennas, Better Bandwidth</h2>’ \n\n\t+ ‘<img src=\"http://g-ec2.images-amazon.com/images/G/01/kindle/dp/2012/KT/tate_feature-wifi._V395653267_.gif\"/>’\n\t\n\t$(‘#kpp-popover-0’).amazonPopoverTrigger({\n\t\tliteralContent: content,\n\t\tcloseText: ‘Close’,\n\t\ttitle: ‘ ’,\n\t\twidth: 550,\n\t\tlocation: ‘centered’\n\t});\n\n});\n}(jQuery)); \n}); \n\n) |
而不是所希望的:
| World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini) |
即,需要去除InnerText中的Javascript。
【解决过程】
1.参考之前就看过的:
向HtmlAgilityPack道歉:解析HTML还是你好用
和对应的:
C#: HtmlAgilityPack extract inner text
然后调试了半天,最终用:
//remove sub node from current html node
//eg:
//"script"
//for
//<script type="text/javascript">
public HtmlNode removeSubHtmlNode(HtmlNode curHtmlNode, string subNodeToRemove)
{
HtmlNode afterRemoved = curHtmlNode;
HtmlNodeCollection foundAllSub = curHtmlNode.SelectNodes(subNodeToRemove);
if ((foundAllSub!= null ) && (foundAllSub.Count > 0))
{
foreach (HtmlNode subNode in foundAllSub)
{
curHtmlNode.RemoveChild(subNode);
}
}
//foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))
//{
// //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
// //Additional information: Collection was modified; enumeration operation may not execute.
// afterRemoved.RemoveChild(subNode);
// curHtmlNode.RemoveChild(subNode);
// //subNode.Remove();
//}
return afterRemoved;
}
HtmlNode curBulletNode = allBulletNodeList[idx];
HtmlNode noJsNode = crl.removeSubHtmlNode(curBulletNode, "script");
HtmlNode noStyleNode = crl.removeSubHtmlNode(curBulletNode, "style");
string bulletStr = noStyleNode.InnerText;而解决了问题。
其中可以看出:
1.那人给出的例子中,用
htmlDoc.DocumentNode.Descendants("script")
找到子节点,然后用
script.Remove();
去删除,是可以的。
2.但是此处如果用,当前的Html节点,做类似的处理:
foreach (var subNode in afterRemoved.Descendants(subNodeToRemove))
{
//An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
//Additional information: Collection was modified; enumeration operation may not execute.
afterRemoved.RemoveChild(subNode);
curHtmlNode.RemoveChild(subNode);
//subNode.Remove();
}就会出现注释中提示的错误:
Additional information: Collection was modified; enumeration operation may not execute.
即,在枚举Collection中,删除其中的值,是不允许的。
所以才想了别的办法去实现类似的remove的效果的。
【总结】
实现类似的删除的效果,真的是累屎了。。。。
删除根节点其下的子节点,好删;
删除当前某个节点下的节点,难删。(后来调试中,发现,其实执行subNode.Remove(); 时,已经删除成功了,但是接着还是会去执行foreach循环,导致报错的。。。)
转载请注明:在路上 » 【已解决】C#用HtmlAgilityPack执行Html解析时,发现InnerText中包含javascript,要去除Javascript