Lxml xpath ignoriert nicht "& nbsp;"

Ich habe diesen HTML:

<td class="0"> <b>Bold Text</b>&nbsp; <a href=""></a> </td> <td class="0"> Regular Text&nbsp; <a href=""></a> </td> 

Was, wenn mit xpath formatiert …

 new_html = tree.xpath('//td[@class="0"]/text() | //td[@class="0"]/b/text()') 

Drucke:

 ['Bold Text', '', 'Regular Text'] 

Wie Sie sehen können, ist die &nbsp; Charakter wurde nicht ignoriert und wird tatsächlich als zusätzlicher Eintrag in td gelesen. Wie bekomme ich eine bessere Leistung?

2 Solutions collect form web for “Lxml xpath ignoriert nicht "& nbsp;"”

Stattdessen würde ich über alle gewünschten td Elemente iterieren und den .text_content() :

 [td.text_content().strip() for td in tree.xpath('//td[@class="0"]')] 

Drucke:

 [u'Bold Text', u'Regular Text'] 

Anmerkung: Ich schreibe das nicht so sehr wie eine Antwort, sondern als interessante Sache (ich wusste nicht) über XPaths normalize-space() . Dies könnte anderen Benutzern helfen.

Es sieht aus wie normalize-space() was ich hier vorgeschlagen hätte, nicht 'NO-BREAK SPACE' (U+00A0)

 >>> text = '''<html> ... <table> ... <tr> ... <td class="0"> ... <b>Bold Text</b>&nbsp; ... <a href=""></a> ... </td> ... ... <td class="0"> ... Regular Text&nbsp; ... <a href=""></a> ... </td> ... </tr> ... </table> ... </html>''' >>> doc = lxml.html.fromstring(text) >>> >>> # ouch, &nbsp; is not stripped... >>> [td.xpath('normalize-space(.)') for td in doc.xpath('.//td[@class="0"]')] [u'Bold Text\xa0', u'Regular Text\xa0'] >>> >>> # one needs to strip() like in @alecxe's answer >>> [td.xpath('normalize-space(.)').strip() for td in doc.xpath('.//td[@class="0"]')] [u'Bold Text', u'Regular Text'] >>> 

Bearbeiten:

Also habe ich immer wieder in Whitespace Charaktere und wie sie gestrippt oder nicht mit Python's strip() oder XPath's normalize-space() .

Das folgende ist ein bisschen länger als ich zuerst wollte, aber er ist das ganze Skript, um Unicode-Whitespace-Codepunkte zu testen:

 >>> import lxml.html >>> import requests >>> >>> whitespace_chars_wikipedia = 'https://en.wikipedia.org/wiki/Whitespace_character#Unicode' >>> r = requests.get(whitespace_chars_wikipedia) >>> >>> doc = lxml.html.fromstring(r.text) >>> >>> >>> import collections >>> import re >>> >>> WhitespaceChar = collections.namedtuple('WhitespaceChar', ['codepoint', 'name', 'decimal', 'named_entity']) >>> r = re.compile('') >>> wchars = {} >>> for table in doc.xpath(''' ... .//div[@class="NavHead"][.//strong="Whitespace"] ... /following-sibling::div[@class="NavContent"] ... //table[1] ... | ... .//table[caption="Related characters"] ... '''): ... for row in table.xpath('.//tr[position()>1]'): ... codepoint = row.xpath('string(./td[1]/text()[last()])') ... name = row.xpath('normalize-space(./td[2])').upper() ... decimal = int(row.xpath('string(./td[3])')) ... named_entity = row.xpath('''string( ... ./td[last()]/text()[contains(., "HTML/XML named entity: ")] ... /following-sibling::code ... )''') ... wchars[decimal] = WhitespaceChar(codepoint, name, decimal, named_entity or None) ... >>> >>> listitems = "\n".join( ... '<li><i>&#x{wchar.decimal:04X};</i> <b data-decimal="{wchar.decimal}">{wchar.codepoint}</b> <i>&#x{wchar.decimal:04X};</i></li>'.format(wchar=c) ... for c in sorted(wchars.values(), key=lambda c: c.decimal) ... ) >>> text = ''' ... <html> ... <body> ... <ul> ... {} ... </ul> ... </body> ... </html> ... '''.format(listitems) >>> print text <html> <body> <ul> <li><i>&#x0009;</i> <b data-decimal="9">U+0009</b> <i>&#x0009;</i></li> <li><i>&#x000A;</i> <b data-decimal="10">U+000A</b> <i>&#x000A;</i></li> <li><i>&#x000B;</i> <b data-decimal="11">U+000B</b> <i>&#x000B;</i></li> <li><i>&#x000C;</i> <b data-decimal="12">U+000C</b> <i>&#x000C;</i></li> <li><i>&#x000D;</i> <b data-decimal="13">U+000D</b> <i>&#x000D;</i></li> <li><i>&#x0020;</i> <b data-decimal="32">U+0020</b> <i>&#x0020;</i></li> <li><i>&#x0085;</i> <b data-decimal="133">U+0085</b> <i>&#x0085;</i></li> <li><i>&#x00A0;</i> <b data-decimal="160">U+00A0</b> <i>&#x00A0;</i></li> <li><i>&#x1680;</i> <b data-decimal="5760">U+1680</b> <i>&#x1680;</i></li> <li><i>&#x180E;</i> <b data-decimal="6158">U+180E</b> <i>&#x180E;</i></li> <li><i>&#x2000;</i> <b data-decimal="8192">U+2000</b> <i>&#x2000;</i></li> <li><i>&#x2001;</i> <b data-decimal="8193">U+2001</b> <i>&#x2001;</i></li> <li><i>&#x2002;</i> <b data-decimal="8194">U+2002</b> <i>&#x2002;</i></li> <li><i>&#x2003;</i> <b data-decimal="8195">U+2003</b> <i>&#x2003;</i></li> <li><i>&#x2004;</i> <b data-decimal="8196">U+2004</b> <i>&#x2004;</i></li> <li><i>&#x2005;</i> <b data-decimal="8197">U+2005</b> <i>&#x2005;</i></li> <li><i>&#x2006;</i> <b data-decimal="8198">U+2006</b> <i>&#x2006;</i></li> <li><i>&#x2007;</i> <b data-decimal="8199">U+2007</b> <i>&#x2007;</i></li> <li><i>&#x2008;</i> <b data-decimal="8200">U+2008</b> <i>&#x2008;</i></li> <li><i>&#x2009;</i> <b data-decimal="8201">U+2009</b> <i>&#x2009;</i></li> <li><i>&#x200A;</i> <b data-decimal="8202">U+200A</b> <i>&#x200A;</i></li> <li><i>&#x200B;</i> <b data-decimal="8203">U+200B</b> <i>&#x200B;</i></li> <li><i>&#x200C;</i> <b data-decimal="8204">U+200C</b> <i>&#x200C;</i></li> <li><i>&#x200D;</i> <b data-decimal="8205">U+200D</b> <i>&#x200D;</i></li> <li><i>&#x2028;</i> <b data-decimal="8232">U+2028</b> <i>&#x2028;</i></li> <li><i>&#x2029;</i> <b data-decimal="8233">U+2029</b> <i>&#x2029;</i></li> <li><i>&#x202F;</i> <b data-decimal="8239">U+202F</b> <i>&#x202F;</i></li> <li><i>&#x205F;</i> <b data-decimal="8287">U+205F</b> <i>&#x205F;</i></li> <li><i>&#x2060;</i> <b data-decimal="8288">U+2060</b> <i>&#x2060;</i></li> <li><i>&#x3000;</i> <b data-decimal="12288">U+3000</b> <i>&#x3000;</i></li> <li><i>&#xFEFF;</i> <b data-decimal="65279">U+FEFF</b> <i>&#xFEFF;</i></li> </ul> </body> </html> >>> >>> >>> doc2 = lxml.html.fromstring(text) >>> >>> from prettytable import PrettyTable >>> >>> x = PrettyTable([ ... #"#", ... #"Code point", ... "Name", ... #"Char Python repr", ... "Test string", ... "strip()", ... "normalize-space()" ... ]) >>> >>> for cnt, li in enumerate(doc2.xpath('.//ul/li'), start=1): ... codepoint = li.xpath('string(b)') ... wc = wchars[li.xpath('number(b/@data-decimal)')] ... tstring = li.xpath('string(.)') ... x.add_row([ ... #cnt, ... #wc.codepoint, ... wc.name, ... #repr([unichr(wc.decimal)]).strip('[]'), ... repr([tstring]).strip('[]'), ... tstring.strip() == codepoint, ... li.xpath('normalize-space(.)') == codepoint ... ]) ... 

strip() und normalize-space() Streifen diese Whitespace Charaktere?

 >>> print x +-------------------------------+-------------------------+---------+-------------------+ | Name | Test string | strip() | normalize-space() | +-------------------------------+-------------------------+---------+-------------------+ | CHARACTER TABULATION | '\t U+0009 \t' | True | True | | LINE FEED | '\n U+000A \n' | True | True | | LINE TABULATION | ' U+000B ' | True | True | | FORM FEED | ' U+000C ' | True | True | | CARRIAGE RETURN | '\r U+000D \r' | True | True | | SPACE | ' U+0020 ' | True | True | | NEXT LINE | u'\x85 U+0085 \x85' | True | False | | NO-BREAK SPACE | u'\xa0 U+00A0 \xa0' | True | False | | OGHAM SPACE MARK | u'\u1680 U+1680 \u1680' | True | False | | MONGOLIAN VOWEL SEPARATOR | u'\u180e U+180E \u180e' | True | False | | EN QUAD | u'\u2000 U+2000 \u2000' | True | False | | EM QUAD | u'\u2001 U+2001 \u2001' | True | False | | EN SPACE | u'\u2002 U+2002 \u2002' | True | False | | EM SPACE | u'\u2003 U+2003 \u2003' | True | False | | THREE-PER-EM SPACE | u'\u2004 U+2004 \u2004' | True | False | | FOUR-PER-EM SPACE | u'\u2005 U+2005 \u2005' | True | False | | SIX-PER-EM SPACE | u'\u2006 U+2006 \u2006' | True | False | | FIGURE SPACE | u'\u2007 U+2007 \u2007' | True | False | | PUNCTUATION SPACE | u'\u2008 U+2008 \u2008' | True | False | | THIN SPACE | u'\u2009 U+2009 \u2009' | True | False | | HAIR SPACE | u'\u200a U+200A \u200a' | True | False | | ZERO WIDTH SPACE | u'\u200b U+200B \u200b' | False | False | | ZERO WIDTH NON-JOINER | u'\u200c U+200C \u200c' | False | False | | ZERO WIDTH JOINER | u'\u200d U+200D \u200d' | False | False | | LINE SEPARATOR | u'\u2028 U+2028 \u2028' | True | False | | PARAGRAPH SEPARATOR | u'\u2029 U+2029 \u2029' | True | False | | NARROW NO-BREAK SPACE | u'\u202f U+202F \u202f' | True | False | | MEDIUM MATHEMATICAL SPACE | u'\u205f U+205F \u205f' | True | False | | WORD JOINER | u'\u2060 U+2060 \u2060' | False | False | | IDEOGRAPHIC SPACE | u'\u3000 U+3000 \u3000' | True | False | | ZERO WIDTH NON-BREAKING SPACE | u'\ufeff U+FEFF \ufeff' | False | False | +-------------------------------+-------------------------+---------+-------------------+ >>> 

Whitespace-Zeichen:

 >>> pprint.pprint(wchars) {9: WhitespaceChar(codepoint='U+0009', name='CHARACTER TABULATION', decimal=9, named_entity=None), 10: WhitespaceChar(codepoint='U+000A', name='LINE FEED', decimal=10, named_entity='&NewLine;'), 11: WhitespaceChar(codepoint='U+000B', name='LINE TABULATION', decimal=11, named_entity=None), 12: WhitespaceChar(codepoint='U+000C', name='FORM FEED', decimal=12, named_entity=None), 13: WhitespaceChar(codepoint='U+000D', name='CARRIAGE RETURN', decimal=13, named_entity=None), 32: WhitespaceChar(codepoint='U+0020', name='SPACE', decimal=32, named_entity=None), 133: WhitespaceChar(codepoint='U+0085', name='NEXT LINE', decimal=133, named_entity=None), 160: WhitespaceChar(codepoint='U+00A0', name='NO-BREAK SPACE', decimal=160, named_entity='&nbsp;'), 5760: WhitespaceChar(codepoint='U+1680', name='OGHAM SPACE MARK', decimal=5760, named_entity=None), 6158: WhitespaceChar(codepoint='U+180E', name='MONGOLIAN VOWEL SEPARATOR', decimal=6158, named_entity=None), 8192: WhitespaceChar(codepoint='U+2000', name='EN QUAD', decimal=8192, named_entity=None), 8193: WhitespaceChar(codepoint='U+2001', name='EM QUAD', decimal=8193, named_entity=None), 8194: WhitespaceChar(codepoint='U+2002', name='EN SPACE', decimal=8194, named_entity='&ensp;'), 8195: WhitespaceChar(codepoint='U+2003', name='EM SPACE', decimal=8195, named_entity='&emsp;'), 8196: WhitespaceChar(codepoint='U+2004', name='THREE-PER-EM SPACE', decimal=8196, named_entity='&emsp13;'), 8197: WhitespaceChar(codepoint='U+2005', name='FOUR-PER-EM SPACE', decimal=8197, named_entity='&emsp14;'), 8198: WhitespaceChar(codepoint='U+2006', name='SIX-PER-EM SPACE', decimal=8198, named_entity=None), 8199: WhitespaceChar(codepoint='U+2007', name='FIGURE SPACE', decimal=8199, named_entity='&numsp;'), 8200: WhitespaceChar(codepoint='U+2008', name='PUNCTUATION SPACE', decimal=8200, named_entity='&puncsp;'), 8201: WhitespaceChar(codepoint='U+2009', name='THIN SPACE', decimal=8201, named_entity='&thinsp;'), 8202: WhitespaceChar(codepoint='U+200A', name='HAIR SPACE', decimal=8202, named_entity='&hairsp;'), 8203: WhitespaceChar(codepoint='U+200B', name='ZERO WIDTH SPACE', decimal=8203, named_entity=None), 8204: WhitespaceChar(codepoint='U+200C', name='ZERO WIDTH NON-JOINER', decimal=8204, named_entity='&zwnj;'), 8205: WhitespaceChar(codepoint='U+200D', name='ZERO WIDTH JOINER', decimal=8205, named_entity='&zwj;'), 8232: WhitespaceChar(codepoint='U+2028', name='LINE SEPARATOR', decimal=8232, named_entity=None), 8233: WhitespaceChar(codepoint='U+2029', name='PARAGRAPH SEPARATOR', decimal=8233, named_entity=None), 8239: WhitespaceChar(codepoint='U+202F', name='NARROW NO-BREAK SPACE', decimal=8239, named_entity=None), 8287: WhitespaceChar(codepoint='U+205F', name='MEDIUM MATHEMATICAL SPACE', decimal=8287, named_entity='&MediumSpace;'), 8288: WhitespaceChar(codepoint='U+2060', name='WORD JOINER', decimal=8288, named_entity='&NoBreak;'), 12288: WhitespaceChar(codepoint='U+3000', name='IDEOGRAPHIC SPACE', decimal=12288, named_entity=None), 65279: WhitespaceChar(codepoint='U+FEFF', name='ZERO WIDTH NON-BREAKING SPACE', decimal=65279, named_entity=None)} >>> 
  • Wie kann XPath mehrere Tabellenelemente mit identischen ID-Attributen auswählen?
  • Python: Verwenden von xpath lokal / auf einem bestimmten Element
  • Wie ordnungsgemäß verwenden Regeln, beschränken_xpaths zu crawlen und analysieren URLs mit scrapy?
  • Scrapy xpath selecter wiederholt Daten
  • Erstellen von Loop, um Tabellendaten in scrapy / python zu analysieren
  • Populäre Python-Liste mit Daten aus dem Befehl lxml xpath
  • Die Grenze von Element Tree auf xpath
  • Xpath analysiert die ganze Seite, wenn ich mich nicht an
  • Wie man eine Variable in xpath python
  • Wie suche ich in XPath in multiline Text mit Python?
  • Brauchen Sie Hilfe bei Verwendung von XPath in ElementTree
  • Python ist die beste Programmiersprache der Welt.