Friday, April 20, 2012

Making lexx to read UTF-8 doesn't work

I wrote an xml parser that parses ASCII files, but I need now to be able to read UTF-8 encoded files. I have the following regex in Lexx but they don't match UTF-8 I am not sure what I am doing wrong:

utf_8       [\x00-\xff]*
bom [\xEF\xBB\xBF]


bom             { fprintf( stderr, "OMG I SAW A BOM"); return BOM;}
utf_8 { fprintf( stderr, "OMG I SAW A UTF CHAR", yytext[0] ); return UTF_8; }

I also have the following grammar rules:

: UTF8 '<' '?'ID attribute_list '?''>'
root ...

where UTF8 is:


: BOM           {printf("i saw a bom\n");}
| UTF_8 {printf("i saw a utf\n");}
| {printf("i didn't see anything.'\n");}

It always comes up with i didn't see anything, my parser works for ASCII files, that is when I copy paste the XML UTF-8 file in a empty document.

Any help would be appreciated.

No comments:

Post a Comment