7

How do I get Perl to read the values ​​of my html form as unicode?

 2 years ago
source link: https://www.codesd.com/item/how-do-i-get-perl-to-read-the-values-of-my-html-form-as-unicode.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How do I get Perl to read the values ​​of my html form as unicode?

advertisements

I have an html form that sends data to .cgi page. Here is the html:

<HTML>

<BODY BGCOLOR="#FFFFFF">

    <FORM METHOD="post" ACTION="test.cgi">

        <B>Write to me below:</B><P>
        <TEXTAREA NAME="feedback" ROWS=10 COLS=50></TEXTAREA><P>

        <CENTER>
            <INPUT TYPE=submit VALUE="SEND">
            <INPUT TYPE=reset VALUE="CLEAR">
        </CENTER>

    </FORM>

</BODY>
</HTML>

Here is the perl script for test.cgi:

#!/usr/bin/perl

use utf8;
use encoding('utf8');
require Encode;
require CGI;

# The following accepts the data from the form and puts it in %FORM

if ($ENV{'REQUEST_METHOD'} eq 'POST') {
    read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});

    @pairs = split(/&/, $buffer);

    foreach $pair (@pairs) {
        ($name, $value) = split(/=/, $pair);
        $value =~ tr/+/ /;
        $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

    $FORM{$name} = $value;
    }

# The following generates the html for the page

    print "Content-type: text/html\n\n";
    print "<HTML>\n";
    print "<HEAD>\n";
    print "<TITLE>Thank You!</TITLE>\n";
    print "</HEAD>\n";
    print "<BODY BGCOLOR=#FFFFCC TEXT=#000000>\n";
    print "<H1>Thank You!</H1>\n";
    print "<P>\n";
    print "<H3>Your feedback is greatly appreciated.</h3><BR>\n";
    print "<P>\n<P>\n";
    print "The user wrote:\n\n";
    print "<P>\n";

# This is print statement A
    print "$FORM{'feedback'}<br>\n";

    $FORM{'feedback'}=~s/(\w)/ $1/g;

# This is print statement B
    print "$FORM{'feedback'}\n";

    print "</BODY>\n";
    print "</HTML>\n";
    exit(0);
}

This all works the way it's supposed to if the user enters English text. However, this will eventually be used in a product where the user will enter Chinese text. So here's an example of the problem. If the user enters "中文" into the form, then Print Statement A prints "中文." However, Print Statement B (which prints $value after the regex has been run) prints " 2 0 0 1 3; 2 5 9 9 1; ". What I want it to print however is "中 文". If you want to see this, go to http://thedeandp.com/chinese/input.html and try it yourself.

So basically, what I've figured out is that when perl reads in the form, it's just treating each byte as a character, so the regex adds a space between each byte. Chinese characters use unicode, so it's multiple bytes to a character. That means the regex breaks up the unicode with a space between the bytes, and that is what produces the output seen in Print Statement B. I've tried methods like $value = Encode::decode_utf8($value) to get perl to treat it as unicode, but nothing has worked so far.


That CGI style could be improved while fixing your encoding decoding issue. Try this–

use strict;
use warnings;
use Encode;
use CGI ":standard";
use HTML::Entities;

print
    header("text/html; charset=utf-8"),
    start_html("Thank you!"),
    h1("Thank You!"),
    h3("Your feedback is greatly appreciated.");

if ( my $feedback = decode_utf8( param("feedback") ) )
{
    print
        p("The user wrote:"),
        blockquote( encode_utf8( encode_entities($feedback) ) );
}

print end_html();

Proper encoding and decoding between octets/bytes and utf-8 is necessary to avoid surprises and allow the Perl to behave as you’d expect.

For example, you can drop this in–

    h4("Which capitalizes as:"),
    blockquote( encode_utf8( uc $feedback ) );

And see character conversions work like so: å™ç∂®r£ ➟ Å™Ç∂®R£

Update: added encode_entities. NEVER print user input back without escaping the HTML. Update to update: which actually will end up escaping the utf-8 depending on the setup (you can have it only escape ['"<>] for example)…


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK