[SGVLUG] Grep "quickie" needed -- searching for hi-bit characters
Christopher Smith
x at xman.org
Fri Jan 4 20:18:53 PST 2008
Claude Felizardo wrote:
> On Jan 4, 2008 3:57 PM, Emerson, Tom (*IC) <Tom.Emerson at wbconsultant.com> wrote:
>
>>
>> I've got an odd one here -- I know how I'd do this on an HP using some
>> proprietary tools I've used for the last 15 years, but this is on a *nix
>> system so I need to know how to do this using grep.
>>
>> We have some files that were transferred from one machine to another [one of
>> which was a PC], and somewhere in the process, it appears that some
>> local-language/"multi-byte" characters got translated to
>> multiple-ascii-bytes, which in turn buggered up the record length.
>> Fortunately, these are easy to detect visually as the new values for each
>> "byte" of the character are between 128 and 255 and generally look like
>> "line noise" when cat'd to the screen. Unfortunately, the files involved
>> are thousands of lines long, so a pure visual search is out of the question.
>>
>> What would I use as a regex to find characters with a byte (ascii) value >
>> 127?
>>
>
> sounds like you should be using sed or perl.
> can't think of the regex right now but if it's suppose to be regular
> text, what about just running the files through strings?
>
This is simple enough to do in C, let alone perl:
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
int main()
{
off_t offset = 0;
int byte;
while (EOF != (byte = getchar())) {
if (byte > 127) {
printf("Offset: %Lu\t Character value: %d\n", (uint64_t)
offset, byte);
}
++offset;
}
return 0;
}
--Chris
More information about the SGVLUG
mailing list