Java Forum: Making lexx to read UTF-8 doesn't work

Friday, April 20, 2012

Making lexx to read UTF-8 doesn't work

I wrote an xml parser that parses ASCII files, but I need now to be able to read UTF-8 encoded files. I have the following regex in Lexx but they don't match UTF-8 I am not sure what I am doing wrong:

utf_8       [\x00-\xff]*
 bom         [\xEF\xBB\xBF]

then:

bom             { fprintf( stderr, "OMG I SAW A BOM"); return BOM;}
 utf_8           { fprintf( stderr, "OMG I SAW A UTF CHAR", yytext[0] ); return UTF_8; }

I also have the following grammar rules:

program 
 : UTF8 '<' '?'ID attribute_list '?''>' 
 root ...

where UTF8 is:

UTF8

: BOM           {printf("i saw a bom\n");}
 | UTF_8         {printf("i saw a utf\n");}
 |               {printf("i didn't see anything.'\n");} 
 ;

It always comes up with i didn't see anything, my parser works for ASCII files, that is when I copy paste the XML UTF-8 file in a empty document.

Any help would be appreciated.

Java Forum

Friday, April 20, 2012

Making lexx to read UTF-8 doesn't work

No comments:

Post a Comment