Friday, April 20, 2012

Making lexx to read UTF-8 doesn't work

I wrote an xml parser that parses ASCII files, but I need now to be able to read UTF-8 encoded files. I have the following regex in Lexx but they don't match UTF-8 I am not sure what I am doing wrong:



utf_8       [\x00-\xff]*
bom [\xEF\xBB\xBF]


then:



bom             { fprintf( stderr, "OMG I SAW A BOM"); return BOM;}
utf_8 { fprintf( stderr, "OMG I SAW A UTF CHAR", yytext[0] ); return UTF_8; }


I also have the following grammar rules:



program 
: UTF8 '<' '?'ID attribute_list '?''>'
root ...


where UTF8 is:



UTF8



: BOM           {printf("i saw a bom\n");}
| UTF_8 {printf("i saw a utf\n");}
| {printf("i didn't see anything.'\n");}
;


It always comes up with i didn't see anything, my parser works for ASCII files, that is when I copy paste the XML UTF-8 file in a empty document.



Any help would be appreciated.





No comments:

Post a Comment