Author

Topic: Something odd (Read 969 times)

legendary
Activity: 1246
Merit: 1077
July 28, 2013, 09:23:56 AM
#16
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.

Again, its character encoding doesn't support Unicode, but the forum does use Unicode. HTML entities are a form of Unicode encoding too.
legendary
Activity: 4536
Merit: 3188
Vile Vixen and Miss Bitcointalk 2021-2023
July 27, 2013, 11:34:47 PM
#15
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.
legendary
Activity: 1302
Merit: 1007
July 27, 2013, 11:21:51 PM
#14
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.
Oh, I see. That's weird, nowadays quite a few websites use Unicode.
hero member
Activity: 784
Merit: 1000
0xFB0D8D1534241423
July 27, 2013, 11:16:47 PM
#13
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
Code:
legendary
Activity: 1400
Merit: 1013
July 27, 2013, 11:02:24 PM
#12
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.
Really? In 2013?
legendary
Activity: 4536
Merit: 3188
Vile Vixen and Miss Bitcointalk 2021-2023
July 27, 2013, 10:49:33 PM
#11
Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.
The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. If it actually does store posts in UTF-8 (or any other encoding), it would have to perform the conversion every time a page is requested, which seems rather wasteful.
legendary
Activity: 1246
Merit: 1077
July 27, 2013, 10:35:49 PM
#10
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.

Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.
legendary
Activity: 4536
Merit: 3188
Vile Vixen and Miss Bitcointalk 2021-2023
July 27, 2013, 10:32:13 PM
#9
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "©". You've probably seen this before on sites that don't perform this conversion correctly.
legendary
Activity: 1302
Merit: 1007
July 27, 2013, 02:45:05 PM
#8
SMF translates all special characters into HTML entities and all newlines into
s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.
Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.
administrator
Activity: 5222
Merit: 13032
July 26, 2013, 07:43:24 PM
#7
No, numbers aren't translated.
legendary
Activity: 1246
Merit: 1077
July 26, 2013, 07:38:22 PM
#6
SMF translates all special characters into HTML entities and all newlines into
s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.
hero member
Activity: 784
Merit: 1000
0xFB0D8D1534241423
July 26, 2013, 07:18:26 PM
#5
There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.
Remember, you're going into HTML. For example, an ampersand (one byte in UTF-8) must become "&" (5 bytes) or "&" (5 bytes). Newlines are encoded as "
" which is 6 bytes.
administrator
Activity: 5222
Merit: 13032
July 26, 2013, 07:16:21 PM
#4
SMF translates all special characters into HTML entities and all newlines into
s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.
legendary
Activity: 1246
Merit: 1077
July 26, 2013, 06:59:33 PM
#3
There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.
administrator
Activity: 5222
Merit: 13032
July 26, 2013, 05:47:48 PM
#2
There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.
legendary
Activity: 1246
Merit: 1077
July 26, 2013, 04:16:53 PM
#1
I was making some expansions to this thread recently. When I saved the post, it was cut off. However, the post is well under the 65535-character limit:

(firefox)
Code:
[17:16:06.713] post.length
[17:16:06.719] 62075

No notice came up; the post was just cut off. The post preview worked as expected.

Why, then, was the post cut off?
Jump to: