| Re: how prevent XML::Parser from resolving entity references? by Alois Heuboeck other posts by this author Sep 16 2005 8:28AM messages near this date Re: how prevent XML::Parser from resolving entity references? | Perl XML project & XSLT (I'd like to...) > > 1- take an XML file > > 2- in one script, replace everything above Unicode #x7F (end of ASCII) > > with entity references (which can either have "special" names, like > > ä or be based on the Unicode nb. like ®) > > 3- then in another script, do some more transformations using XML::DOM > > and > > 4- print out resulting XML > > > > My problem is that in the third step, when parsing its input, the > > XML::Parser seems to resolve those references that contain the HEX > > Unicode nb.; the "special name" references are not resolved. > > > > Strings like ® are character references rather than entity > references. A character reference is just an alternative way to express > a character code point. Parsers make no difference between a character > encoded with a specific encoding (such as utf-8) and the character > reference. Your step 2 doesn't make much sense to me as XML works well > with Unicode. What is the reason for it? Petr & other Perlers, thanks for you reply. I'm working on a linguistic corpus project. Some of the tools for which texts of the corpus should be usable, are not Unicode-aware. Basically, that means little more than that they cannot display it. My thought was that by re-coding as a character reference, although we still couldn't display it, at least the information would be retrievable by having a look at the underlying XML file (look up the code point in a Unicode table). Does this make sense to you? But then, I also encountered another problem when I skipped the phase of re-coding: I still have the script of step 2, which prints out the file after some transformations: ----------------------------------- #!/usr/bin/perl use strict; use warnings; use encoding 'utf-8'; my $infile = "file1.xml"; my $outfile = "file2.xml"; print "OUT =:\n$outfile\n\n"; open IN, "$infile" or die "\ncannot read specified infile\n$infile\n"; my $text = join "", <IN> ; close IN; # etc. etc. # finally print it out open OUT, "> $outfile" or die "cannot create out file"; # Alternatively, I tried this but it # seems to make no difference: # open OUT, "> :encoding(utf-8)", $outfile or die "cannot create out file"; print OUT $text; close OUT; ----------------------------------- Here's a snippet of the output I'm getting (here all text, no mark-up): ...to which Henri Bergson referred as "durée"; the way in which... ... which is OK. Then, I open the file with the next script, parse it and print it out: ----------------------------------- #!/usr/bin/perl use strict; use XML::DOM; use warnings; my $infile = "file2.xml"; my $outfile = "file3.xml"; my $dom_parser = new XML::DOM::Parser(); my $TREE = $dom_parser-> parsefile($infile); # no adjacent text nodes $TREE-> normalize(); open OUT, "> $outfile" or die "could not open outfile"; print OUT $TREE-> toString(); close OUT; ----------------------------------- The line from above now looks like this: ...to which Henri Bergson referred as "dur㩥; the way in which... I suspect that the parser interprets the IN stream as some wrong encoding. But I really can't see how, I thought that both were UTF-8?? Best, Alois _______________________________________________ Perl-XML mailing list Perl-XML@[...].com To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs |