EUC環境からUTF-8環境への移行

テキストデータの文字コード変換

EUCの資産をUTF-8環境へ引き継ぐ方法として、nkfやiconvという文字コード変換ツールがあります。
テキストファイルのデータを変換します。

nkfもiconvも一旦、別のファイルに保存する必要がありますので、元のファイル名で保存したい場合は、別ファイルに保存した後に、元のファイルを削除し、元のファイル名にmvする必要があります。

nkfコマンド

漢字コードの変換、改行コードの変換は、SCP, SFTP を使ってサーバにファイルを転送してから nkf コマンドなどで変換してください。
以下に、nkf のオプションを記載します。

       -s     output MS-kanji (shifted-JIS) code.
       -e     output EUC (AT&T) code.
       -L[wmu] new line mode
                  -Lu   unix (LF)
                  -Lw   windows (CRLF)
                  -Lm   mac (CR)

例：変換元ファイル(in-file.txt)を、文字コードを EUC に変換、同時に改行コードを

    unix 用に変換して、ファイル(out-file.txt)に保存する。
    nkf -e -Lu in-file.txt > out-file.txt

nkfでファイルの文字コードを調べる

nkfのコマンドオプション -g ( --guess )を使うとファイルの文字コードを調べることが出来ます。

[hat@node0 ~]$ ls 日本語*
日本語のファイル.txt  日本語のファイル1.txt  日本語のファイル2.txt  日本語のファイル3.txt

[hat@node0 ~]$ nkf -g 日本語*
日本語のファイル.txt:UTF-8
日本語のファイル1.txt:Shift_JIS
日本語のファイル2.txt:EUC-JP
日本語のファイル3.txt:UTF-8

NKF HELP

[hat@node0 ~]$ nkf --help
USAGE:  nkf(nkf32,wnkf,nkf2) -[flags] [in file] .. [out file for -O flag]
Flags:
b,u      Output is buffered (DEFAULT),Output is unbuffered
j,s,e,w  Outout code is JIS 7 bit (DEFAULT), Shift JIS, EUC-JP, UTF-8N
         After 'w' you can add more options. -w[ 8 [0], 16 [[BL] [0]] ]
J,S,E,W  Input assumption is JIS 7 bit , Shift JIS, EUC-JP, UTF-8
         After 'W' you can add more options. -W[ 8, 16 [BL] ]
t        no conversion
i[@B]    Specify the Esc Seq for JIS X 0208-1978/83 (DEFAULT B)
o[BJH]   Specify the Esc Seq for ASCII/Roman        (DEFAULT B)
r        {de/en}crypt ROT13/47
h        1 katakana->hiragana, 2 hiragana->katakana, 3 both
v        Show this usage. V: show version
m[BQN0]  MIME decode [B:base64,Q:quoted,N:non-strict,0:no decode]
M[BQ]    MIME encode [B:base64 Q:quoted]
l        ISO8859-1 (Latin-1) support
f/F      Folding: -f60 or -f or -f60-10 (fold margin 10) F preserve nl
Z[0-3]   Convert X0208 alphabet to ASCII
         1: Kankaku to 1 space  2: to 2 spaces  3: Convert to HTML Entity
X,x      Assume X0201 kana in MS-Kanji, -x preserves X0201
B[0-2]   Broken input  0: missing ESC,1: any X on ESC-[($]-X,2: ASCII on NL
O        Output to File (DEFAULT 'nkf.out')
I        Convert non ISO-2022-JP charactor to GETA
d,c      Convert line breaks  -d: LF  -c: CRLF
-L[uwm]  line mode u:LF w:CRLF m:CR (DEFAULT noconversion)

Long name options
 --ic=<input codeset>  --oc=<output codeset>
                   Specify the input or output codeset
 --fj  --unix --mac  --windows
 --jis  --euc  --sjis  --utf8  --utf16  --mime  --base64
                   Convert for the system or code
 --hiragana  --katakana  --katakana-hiragana
                   To Hiragana/Katakana Conversion
 --prefix=         Insert escape before troublesome characters of Shift_JIS
 --cap-input, --url-input  Convert hex after ':' or '%'
 --numchar-input   Convert Unicode Character Reference
 --fb-{skip, html, xml, perl, java, subchar}
                   Specify how nkf handles unassigned characters
 --in-place[=SUFFIX]  --overwrite[=SUFFIX]
                   Overwrite original listed files by filtered result
                   --overwrite preserves timestamp of original files
 -g  --guess       Guess the input code
 --help  --version Show this help/the version
                   For more information, see also man nkf

Network Kanji Filter Version 2.0.7 (2006-06-13)
Copyright (C) 1987, FUJITSU LTD. (I.Ichikawa),2000 S. Kono, COW
Copyright (C) 2002-2006 Kono, Furukawa, Naruse, mastodon

iconvコマンド

iconvコマンドもnkfと同様に文字コードを変換することが出来ます。ただし、nkfでは入力ファイルの文字コードを自動判別してくれますが、iconvの場合は、指定する必要があります。

[hat@node0 ~]$ iconv --help
使用法: iconv [オプション...] [ファイル...]
与えられたファイルのエンコーディングを変換する.

 入力/出力フォーマットの仕様:
  -f, --from-code=名前     元のテキストのエンコーディング
  -t, --to-code=名前       出力時のエンコーディング

 情報:
  -l, --list                 既知の全キャラクタセットを表示

 出力制御:
  -c                         出力から不正な文字を抜かす
  -o, --output=FILE          出力ファイル
  -s, --silent               警告の抑制
      --verbose              経過情報の表示

  -?, --help                 このヘルプの表示
      --usage                短い使用方法の表示
  -V, --version              プログラムのバージョンを表示する

ロングオプションで必須または任意の引数は、それに対応するショートオプションでも同じように必須または任意です.

For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

ファイル名の変換

ファイルの中身は、nkfやiconvで変換できますが、ファイル名の文字コードを変換するには、convmvというツールがあります。
Perlで書かれたツールです。
http://j3e.de/linux/convmv/

インストール方法
```
$ tar zxvf convmv-1.08.tar.gz
$ cd convmv-1.08
$ make
$ su

# make install
```
- CentOSなどでは、yum install convmv
- Ubuntuなどでは、apt-get install convmv
実行方法(EUCからUTF-8へ変更する時)
1. 変換対象ファイルが格納されているディレクトリへ移動
```
$ cd target
```
2. ファイル名がきちんと変換されるか確認する
```
$ convmv -r -f euc-jp -t utf8 *
```
3. ファイル名を変換する
```
$ convmv --notest -r -f euc-jp -t utf8 *
```

Options

--notest...変換作業を実行する

-r...再帰的にディレクトリを探索する

-f...変換元の文字コードの指定

-t...変換先の文字コードの指定

実行例

指定したコード以外のファイル名がある場合は、ファイル名が表示されファイル名の変換は実行されない。ファイルを指定して別オプションでファイル名を変換する。

[hat@node0 ~]$ convmv -i -f euc-jp -t utf8 *.txt
Starting a dry run without changes...
this file was not validly encoded in euc-jp: "./���{���̃t�@�C��.txt"
To prevent damage to your files, we won't continue.
First fix this or correct options!

ファイルを確認する。

[hat@node0 ~]$ ls
???ܸ??Υե?????.txt     ???{???̃t?@?C??.txt      日本語のファイル3.txt

???{???̃t?@?C??.txtは、euc-jpではないことが判ったので、sjisで試してみる。

[hat@node0 ~]$ convmv  -f sjis -t utf8 ???{???̃t?@?C??.txt
Starting a dry run without changes...
mv "./���{���̃t�@�C��.txt"      "./日本語のファイル.txt"
No changes to your files done. Use --notest to finally rename the files.

変換できる様です。

もう一つのファイルはeuc-jpです。(最初の実行時にeuc-jpでは無いと表示したのは1つだけなので)

[hat@node0 ~]$ convmv  -f euc-jp -t utf8 ???ܸ??Υե?????.txt
Starting a dry run without changes...
mv "./���ܸ��Υե�����.txt"        "./日本語のファイル.txt"
No changes to your files done. Use --notest to finally rename the files.

やはり、euc-jpでした。変換できる様です。

インタラクティブの場合は、ファイル毎に確認を求める。

[hat@node0 ~]$ convmv -i -f euc-jp -t utf8 *.txt
Starting a dry run without changes...
mv "./���ܸ��Υե�����.txt"        "./日本語のファイル.txt" (y/n) n
Skipping, already UTF-8: ./日本語のファイル3.txt
No changes to your files done. Use --notest to finally rename the files.

--notest オプションを付けない場合は、ファイル名の確認だけ行う。

[hat@node0 ~]$ convmv  -f euc-jp -t utf8 *.txt
Starting a dry run without changes...
mv "./���ܸ��Υե�����.txt"        "./日本語のファイル.txt"
Skipping, already UTF-8: ./日本語のファイル3.txt
No changes to your files done. Use --notest to finally rename the files.

SHIFTJISのファイルを変換してみましょう。( --notest オプションを付ける)

[hat@node0 ~]$ convmv -f sjis -t utf8 ???{???̃t?@?C??.txt --notest
mv "./??{??̃t?@?C??.txt" "./日本語のファイル.txt"
Ready!

“./日本語のファイル.txt”というファイルに変換されました。

もう一つのEUC-JPのファイルも変換してみましょう。(同じファイル名がある)

[hat@node0 ~]$ convmv -f euc-jp -t utf8 ???ܸ??Υե?????.txt --notest
mv "./???ܸ?Υե?????.txt" "./日本語のファイル.txt"
日本語のファイル.txt exists and differs or --replace option missing - skipped
Ready!

同じファイル名がある場合は、スキップします。( --replace オプションを付けると上書きされる)

CONVMV HELP

Your Perl version has fleas #22111 #37757 
convmv 1.14 - converts filenames from one encoding to another
Copyright (C) 2003-2008 Bjoern JACKE <bjoern@j3e.de>

This program comes with ABSOLUTELY NO WARRANTY; it may be copied or modified
under the terms of the GNU General Public License version 2 or 3 as published
by the Free Software Foundation.

 USAGE: convmv [options] FILE(S)
-f enc     encoding *from* which should be converted
-t enc     encoding *to* which should be converted
-r         recursively go through directories
-i         interactive mode (ask for each action)
--nfc      target files will be normalization form C for UTF-8 (Linux etc.)
--nfd      target files will be normalization form D for UTF-8 (OS X etc.)
--qfrom    be quiet about the "from" of a rename (if it screws up your terminal e.g.)
--qto      be quiet about the "to" of a rename (if it screws up your terminal e.g.)
--exec c   execute command instead of rename (use #1 and #2 and see man page)
--list     list all available encodings
--lowmem   keep memory footprint low (see man page)
--nosmart  ignore if files already seem to be UTF-8 and convert if posible
--notest   actually do rename the files
--replace  will replace files if they are equal
--unescape convert%20ugly%20escape%20sequences
--upper    turn to upper case
--lower    turn to lower case
--parsable write a parsable todo list (see man page)
--help     print this help

AFFRIT Portal

目次