Your Ad Here
首页 | 编程语言 | 网站建设 | 游戏天堂 | 冲浪宝典 | 网络安全 | 操作系统 | 软件时空 | 硬件指南 | 病毒相关 | IT 认证
软讯网络 > 软件时空 > 软件相关 > perl-xml
【标  题】:perl-xml
【关键字】:perl-xml
【来  源】:BLOG.CSDN.NET

perl-xml

Your Ad Here perl-xml
Re: how prevent XML::Parser from resolving entity references?
by Alois Heuboeck other posts by this author
Sep 16 2005 8:28AM messages near this date
Re: how prevent XML::Parser from resolving entity references? | Perl XML project
& XSLT (I'd like to...)
> > 1- take an XML file
> > 2- in one script, replace everything above Unicode #x7F (end of ASCII) 
> > with entity references (which can either have "special" names, like 
> > ä or be based on the Unicode nb. like ®)
> > 3- then in another script, do some more transformations using XML::DOM 
> > and
> > 4- print out resulting XML
> >
> > My problem is that in the third step, when parsing its input, the 
> > XML::Parser seems to resolve those references that contain the HEX 
> > Unicode nb.; the "special name" references are not resolved.
>  
>  
>  
>  Strings like ® are character references rather than entity 
>  references. A character reference is just an alternative way to express 
>  a character code point. Parsers make no difference between a character 
>  encoded with a specific encoding (such as utf-8) and the character 
>  reference. Your step 2 doesn't make much sense to me as XML works well 
>  with Unicode. What is the reason for it?


Petr
& other Perlers,

thanks for you reply.
I'm working on a linguistic corpus project. Some of the tools for which 
texts of the corpus should be usable, are not Unicode-aware.
Basically, that means little more than that they cannot display it.
My thought was that by re-coding as a character reference, although we 
still couldn't display it, at least the information would be retrievable 
by having a look at the underlying XML file (look up the code point in a 
Unicode table).

Does this make sense to you?


But then, I also encountered another problem when I skipped the phase of 
re-coding:
I still have the script of step 2, which prints out the file after some 
transformations:

-----------------------------------
	#!/usr/bin/perl

	use strict;
	use warnings;
	use encoding 'utf-8';

	my $infile = "file1.xml";
	my $outfile = "file2.xml";

	print "OUT =:\n$outfile\n\n";

	open IN, "$infile" or die "\ncannot read
	specified infile\n$infile\n";
	my $text = join "", <IN> ;
	close IN;

	# etc. etc.

	# finally print it out

	open OUT, "> $outfile" or die "cannot create out file";

	# Alternatively, I tried this but it
	# seems to make no difference:
	# open OUT, "> :encoding(utf-8)", $outfile or die "cannot create out file";

	print OUT $text;
	close OUT;
-----------------------------------


Here's a snippet of the output I'm getting (here all text, no mark-up):

	...to which Henri Bergson referred as "durée"; the way in which...

... which is OK.
Then, I open the file with the next script, parse it and print it out:

-----------------------------------
	#!/usr/bin/perl
	use strict;
	use XML::DOM;
	use warnings;

	my $infile = "file2.xml";
	my $outfile = "file3.xml";

	my $dom_parser = new XML::DOM::Parser();
	my $TREE = $dom_parser-> parsefile($infile);

	# no adjacent text nodes
	$TREE-> normalize();

	open OUT, "> $outfile" or die "could not open outfile";
	print OUT $TREE-> toString();
	close OUT;
-----------------------------------


The line from above now looks like this:

	...to which Henri Bergson referred as "dur&#14949;; the way in which...

I suspect that the parser interprets the IN stream as some wrong 
encoding. But I really can't see how, I thought that both were UTF-8??

Best,
Alois

_______________________________________________
Perl-XML mailing list
Perl-XML@[...].com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
web开发的下一个学习方向:ajax:【上一篇】
Perl 模块:【下一篇】
【相关文章】
没有相关文章
【随机文章】
  • ExtActns 中 Consts.SUrlMonDllMissing 问题解决。
  • Windows 2003下提高FSO的安全性
  • Linux服务器设置指南-代理接入服务器(3)
  • 关于《MSDN WebCast苏鹏所做的Atlas相关讲座内容很多都抄袭了我Blog中的内容》的声明
  • 编写自定义任务,轻松扩展Ant
  • 好玩的SQL 好笨的我 Server連線字串
  • 用安全配置向导提高文件服务器安全性
  • 完美解决分割DataWindow
  • Linux操作系统下的网络邻居软件大全
  • ATM参考模型
  • 【相关评论】
    没有相关评论
    【发表评论】
    姓名:
    邮件:
    随机码*
    评论*
          
    |  首 页  |  版权声明  |  联系我们   |  网站地图  |
    CopyRight © 2004-2007 bbb软讯网络 All Rigths Reserved.