Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
585 views
in Technique[技术] by (71.8m points)

json - Handling Unicode sequences in postgresql

I have some JSON data stored in a JSON (not JSONB) column in my postgresql database (9.4.1). Some of these JSON structures contain unicode sequences in their attribute values. For example:

{"client_id": 1, "device_name": "FooBarufffdu0000ufffdu000fufffd" }

When I try to query this JSON column (even if I'm not directly trying to access the device_name attribute), I get the following error:

ERROR: unsupported Unicode escape sequence
Detail: u0000 cannot be converted to text.

You can recreate this error by executing the following command on a postgresql server:

select '{"client_id": 1, "device_name": "FooBarufffdu0000ufffdu000fufffd" }'::json->>'client_id'

The error makes sense to me - there is simply no way to represent the unicode sequence NULL in a textual result.

Is there any way for me to query the same JSON data without having to perform "sanitation" on the incoming data? These JSON structures change regularly so scanning a specific attribute (device_name in this case) would not be a good solution since there could easily be other attributes that might hold similar data.


After some more investigations, it seems that this behavior is new for version 9.4.1 as mentioned in the changelog:

...Therefore u0000 will now also be rejected in json values when conversion to de-escaped form is required. This change does not break the ability to store u0000 in json columns so long as no processing is done on the values...

Was this really the intention? Is a downgrade to pre 9.4.1 a viable option here?


As a side note, this property is taken from the name of the client's mobile device - it's the user that entered this text into the device. How on earth did a user insert NULL and REPLACEMENT CHARACTER values?!

question from:https://stackoverflow.com/questions/31671634/handling-unicode-sequences-in-postgresql

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

u0000 is the one Unicode code point which is not valid in a string. I see no other way than to sanitize the string.

Since json is just a string in a specific format, you can use the standard string functions, without worrying about the JSON structure. A one-line sanitizer to remove the code point would be:

SELECT (regexp_replace(the_string::text, '\u0000', '', 'g'))::json;

But you can also insert any character of your liking, which would be useful if the zero code point is used as some form of delimiter.

Note also the subtle difference between what is stored in the database and how it is presented to the user. You can store the code point in a JSON string, but you have to pre-process it to some other character before processing the value as a json data type.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...