newspaint

Documenting Problems That Were Difficult To Find The Answer To

Converting Russian Encodings to UTF-8 Using Perl

I want to watch TV shows with Russian subtitles. VLC (the video player) on my PC defaults to interpreting subtitle files (e.g. .srt files) as UTF-8.

Some of my subtitle files are not in UTF-8. Sometimes they are encoded as Windows-1251 and in one case they were in UTF-16BE. I’d rather not have to guess and have a script try out all the possible encodings and then write out any non-UTF-8 files to UTF-8.

The following UTF-8 script does this with whatever is on the command line:

#!/usr/bin/perl -w

use utf8; # tell Perl this source code file is utf-8

my @files = ();
foreach ( @ARGV ) {
    push( @files, glob( $_ ) ); # necessary for Windows
}

my @encodings = qw(windows-1251 utf-32 utf-16 utf-16-le utf-16-be); # utf-8 utf8 UTF-8
sub process_file {
    my ( $fname ) = @_;
	
	foreach my $encoding ( @encodings ) {
	    my $fin;
		if ( ! open( $fin, "<$fname" ) ) {
		    warn( "Could not open \"$fname\": $!" );
			last;
		}
		
		if ( ! binmode $fin, ":encoding($encoding)" ) {
		    warn( "Could not set encoding \"$encoding\"" );
			next;
		}
		
		my $data = "";
		eval {
			local $/ = undef;
			$data = <$fin>;
		};
		
		# assume the Russian word for "yes" is somewhere in the file
		next if ( $data !~ m{[Дд][Аа]} );

		print( "  - found encoding $encoding\n" );
		my $fout;
		if ( ! open( $fout, ">$fname.new" ) ) {
		    warn( "Could not write to $fname.new: $!" );
			last;
		}
			
		if ( ! binmode $fout, ":encoding(utf-8)" ) {
		    warn( "Could not set encoding utf-8 for output" );
			last;
		}
			
		print( $fout $data );
		close( $fout );
		last;
	}
}

foreach ( @files ) {
	printf( "Processing file \"$_\"\n" );
	process_file( $_ );
}

You can alternatively use the following script which does not need to be saved as UTF-8:

#!/usr/bin/perl -w

my @files = ();
foreach ( @ARGV ) {
    push( @files, glob( $_ ) ); # necessary for Windows
}

my @encodings = qw(windows-1251 utf-32 utf-16 utf-16-le utf-16-be); # utf-8 utf8 UTF-8
sub process_file {
    my ( $fname ) = @_;
	
	foreach my $encoding ( @encodings ) {
	    my $fin;
		if ( ! open( $fin, "<$fname" ) ) {
		    warn( "Could not open \"$fname\": $!" );
			last;
		}
		
		if ( ! binmode $fin, ":encoding($encoding)" ) {
		    warn( "Could not set encoding \"$encoding\"" );
			next;
		}
		
		my $data = "";
		eval {
			local $/ = undef;
			$data = <$fin>;
		};
		
		# assume the Russian word for "yes" is somewhere in the file
		next if ( $data !~ m{[\x{414}\x{434}][\x{410}\x{430}]} );

		print( "  - found encoding $encoding\n" );
		my $fout;
		if ( ! open( $fout, ">$fname.new" ) ) {
		    warn( "Could not write to $fname.new: $!" );
			last;
		}
			
		if ( ! binmode $fout, ":encoding(utf-8)" ) {
		    warn( "Could not set encoding utf-8 for output" );
			last;
		}
			
		print( $fout $data );
		close( $fout );
		last;
	}
}

foreach ( @files ) {
	printf( "Processing file \"$_\"\n" );
	process_file( $_ );
}

You can run this using the command:

C:\>perl russian-to-utf8.pl *.srt 2>NUL

or

user@myhost# perl russian-to-utf8.pl *.srt >/dev/null

If any file is detected to be Russian and in a non-UTF-8 encoding a new one in UTF-8 will be written with the extension .new appended.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: