Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
337 views
in Technique[技术] by (71.8m points)

php - how to use imagick annotateImage for chinese text?

I need to annotate an image with Chinese Text and I am using Imagick library right now.

An example of a Chinese Text is

这是中文

The Chinese Font file used is this

The file originally is named 华文黑体.ttf

it can also be found in Mac OSX under /Library/Font

I have renamed it to English STHeiTi.ttf make it easier to call the file in php code.

In particular the Imagick::annotateImage function

I also am using the answer from "How can I draw wrapped text using Imagick in PHP?".

The reason why I am using it is because it is successful for English text and application needs to annotate both English and Chinese, though not at the same time.

The problem is that when I run the annotateImage using Chinese text, I get annotation that looks like 罍

Code included here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The problem is you are feeding imagemagick the output of a "line splitter" (wordWrapAnnotation), to which you are utf8_decodeing the text input. This is wrong for sure, if you are dealing with Chinese text. utf8_decode can only deal with UTF-8 text that CAN be converted to ISO-8859-1 (the most common 8-bit extension of ASCII).

Now, I hope that you text is UTF-8 encoded. If it is not, you might be able to convert it like this:

$text = mb_convert_encoding($text, 'UTF-8', 'BIG-5');

or like this

$text = mb_convert_encoding($text, 'UTF-8', 'GB18030'); // only PHP >= 5.4.0

(in your code $text is rather $text1 and $text2).

Then there are (at least) two things to fix in your code:

  1. pass the text "as is" (without utf8_decode) to wordWrapAnnotation,
  2. change the argument of setTextEncoding from "utf-8" to "UTF-8" as per specs

I hope that all variables in your code are initialized in some missing part of it. With the two changes above (the second one might not be necessary, but you never know...), and with the missing parts in place, I see no reason why your code should not work, unless your TTF file is broken or the Imagick library is broken (imagemagick, on which Imagick is based, is a great library, so I consider this last possibility rather unlikely).

EDIT:

Following your request, I update my answer with

a) the fact that setting mb_internal_encoding('utf-8') is very important for the solution, as you say in your answer, and

b) my proposal for a better line splitter, that works acceptably for western languages and for Chinese, and that is probably a good starting point for other languages using Han logograms (Japanese kanji and Korean hanja):

function wordWrapAnnotation(&$image, &$draw, $text, $maxWidth)
{
   $regex = '/( |(?=p{Han})(?<!p{Pi})(?<!p{Ps})|(?=p{Pi})|(?=p{Ps}))/u';
   $cleanText = trim(preg_replace('/[sv]+/', ' ', $text));
   $strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE |
                                                PREG_SPLIT_NO_EMPTY);
   $linesArr = array();
   $lineHeight = 0;
   $goodLine = '';
   $spacePending = false;
   foreach ($strArr as $str) {
      if ($str == ' ') {
         $spacePending = true;
      } else {
         if ($spacePending) {
            $spacePending = false;
            $line = $goodLine.' '.$str;
         } else {
            $line = $goodLine.$str;
         }
         $metrics = $image->queryFontMetrics($draw, $line);
         if ($metrics['textWidth'] > $maxWidth) {
            if ($goodLine != '') {
               $linesArr[] = $goodLine;
            }
            $goodLine = $str;
         } else {
            $goodLine = $line;
         }
         if ($metrics['textHeight'] > $lineHeight) {
            $lineHeight = $metrics['textHeight'];
         }
      }
   }
   if ($goodLine != '') {
      $linesArr[] = $goodLine;
   }
   return array($linesArr, $lineHeight);
}

In words: the input is first cleaned up by replacing all runs of whitespace, including newlines, with a single space, except for leading and trailing whitespace, which is removed. Then it is split either at spaces, or right before Han characters not preceded by "leading" characters (like opening parentheses or opening quotes), or right before "leading" characters. Lines are assembled in order not to be rendered in more than $maxWidth pixels horizontally, except when this is not possible by the splitting rules (in which case the final rendering will probably overflow). A modification in order to force splitting in overflow cases is not difficult. Note that, e.g., Chinese punctuation is not classified as Han in Unicode, so that, except for "leading" punctuation, no linebreak can be inserted before it by the algorithm.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...