Home > Linux, Scripts > File Oddity

File Oddity

February 13th, 2009 Leave a comment Go to comments

Today at work I was attempting to parse a file and discovered something odd happening. When I simply viewed the file with cat, I could see this:

<html><head><title>Status</title></head>
<table>
<tr><td>Failed</td><td>Backup Group</td></tr>
<tr><td>Success</td><td>Another Backup Group</td></tr>
</table>
</body>
</html>

Nothing odd there, the file is normal but when I tried this command:

$ grep -i failed status.html
$

Huh? No output, suggesting that there is no lines with the words failed on them. The same occurs with awk and sed, indeed I could not find any tool to be able to grep out the status. So the next step was to check what was odd about the file:

$ file status.html
status.html: HTML document text

Still, nothing unusual. So now I am in full head scratching mode, I open up the file with vim to see if I can discover anything strange about the file but nothing. At this point I happened to switch to more rather than cat and the result was the start of how I solved it.

$ more status.html
��<

$

Ah, so now I can see why grep and the others cannot view anything in the file. So this time I switch to vim again and check what file encoding we have:

:set fileencoding
fileencoding=ucs-2le

For those that are unaware, this is UCS-2 (little endian), also know as UTF-16. So the issue was simply that we had UTF-16 characters, now for the trick to get around it:

$ iconv -f UTF-16 -t UTF-8 status.html | grep -i failed
<tr><td>Failed</td><td>Backup Group</td></tr>
$

Tada. Once more a solution!

Tags: ,
  1. No comments yet.
  1. No trackbacks yet.