Romanize filenames

Keywords: UTF-8,romanize, cyrillic, latin, convert, filename

When upgrading from previous versions that did not yet have the “romanize” function, you will encounter a completely 'unreadable' directory structure.

For example : %D0%BA%D1%8B%D1%80%D0%B3%D1%8B%D0%B7%D1%81%D1%82%D0%B0%D0%BD.txt is the same as кыргызстан.txt

This is because UTF-8 filenames have been urlencoded.

In later versions, the “romanization” option has been added to circumvent this problem. 1)

The script below will convert this unreadable directory structure to “romanized” filenames.

You will have to include the UTF8.php file which is part of the dokuwiki installation.

Please note: this script is not error free: for example: there are some cyrillic characters that will end your filename with ”'”. Please check your pagestructure after conversion for invalid filenames.

I hope this will help someone. Any improvements welcome.

 
<?php
 
include("utf8.php");
 
/**
 * Copy a file, or recursively copy a folder and its contents, and clean up the filenames according to the dokuwiki UTF-8 
 *
 * @original_author      Aidan Lister <aidan@php.net>
 * @link        http://aidanlister.com/repos/v/function.copyr.php
 * @param       string   $source    Source path
 * @param       string   $dest      Destination path
 * @return      bool     Returns TRUE on success, FALSE on failure
 */
function copyr($source, $dest)
{
	$dest2=cleanID($dest);
	echo $source."->".$dest." ->$dest2<br/>";
    // Simple copy for a file
    if (is_file($source)) {
        return copy($source, $dest2);
    }
 
    // Make destination directory
    if (!is_dir($dest)) {
        mkdir($dest2);
 
	}
 
    // Loop through the folder
    $dir = dir($source);
    while (false !== $entry = $dir->read()) {
        // Skip pointers
        if ($entry == '.' || $entry == '..') {
            continue;
        }
 
        // Deep copy directories
        if ($dest !== "$source/$entry") {
            copyr("$source/$entry", "$dest/$entry");
        }
    }
 
    // Clean up
    $dir->close();
    return true;
}
 
copyr("/dokuwiki/data/pages/","/dokuwiki/data/pagesnew/");
 
function cleanID($id,$ascii=false){
  $id = trim(urldecode($id));
  $id = utf8_strtolower($id);
  $id = utf8_romanize($id);
  utf8_deaccent($id,-1);
  $id = preg_replace('#\'+#','_',$id);
  return($id);
}
 
 
?>
1) see deaccent and romanization for more info