Something odd | Bitcointalksearch.org

dree12

legendary

Activity: 1246

Merit: 1077

Quote from: Foxpup on July 27, 2013, 11:34:47 PM

Quote from: justusranvier on July 27, 2013, 11:02:24 PM

Quote from: Foxpup on July 27, 2013, 10:49:33 PM

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.

Really? In 2013?

Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.

Again, its character encoding doesn't support Unicode, but the forum does use Unicode. HTML entities are a form of Unicode encoding too.

Foxpup

legendary

Activity: 4542

Merit: 3393

Vile Vixen and Miss Bitcointalk 2021-2023

Quote from: justusranvier on July 27, 2013, 11:02:24 PM

Quote from: Foxpup on July 27, 2013, 10:49:33 PM

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.

Really? In 2013?

Why not? HTML itself only uses plain ASCII characters, and HTML entities allow any other character to be represented in ASCII text. You could encode a Chinese-Klingon dictionary in ASCII using HTML entities if you really wanted to, though it would take a whopping 8 bytes per character.

btcton

legendary

Activity: 1302

Merit: 1007

Quote from: Foxpup on July 27, 2013, 10:32:13 PM

Quote from: btcton on July 27, 2013, 02:45:05 PM

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.

The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "Â©". You've probably seen this before on sites that don't perform this conversion correctly.

Oh, I see. That's weird, nowadays quite a few websites use Unicode.

nimda

hero member

Activity: 784

Merit: 1000

0xFB0D8D1534241423

Quote from: justusranvier on July 27, 2013, 11:02:24 PM

Quote from: Foxpup on July 27, 2013, 10:49:33 PM

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.

Really? In 2013?

Code:

justusranvier

legendary

Activity: 1400

Merit: 1013

Quote from: Foxpup on July 27, 2013, 10:49:33 PM

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages.

Really? In 2013?

Foxpup

legendary

Activity: 4542

Merit: 3393

Vile Vixen and Miss Bitcointalk 2021-2023

Quote from: dree12 on July 27, 2013, 10:35:49 PM

Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.

The forum does not use UTF-8, or any other flavour of Unicode. It uses ISO-8859-1, or at least, that's how it serves its pages. If it actually does store posts in UTF-8 (or any other encoding), it would have to perform the conversion every time a page is requested, which seems rather wasteful.

dree12

legendary

Activity: 1246

Merit: 1077

Quote from: Foxpup on July 27, 2013, 10:32:13 PM

Quote from: btcton on July 27, 2013, 02:45:05 PM

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.

The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "Â©". You've probably seen this before on sites that don't perform this conversion correctly.

Unicode should be UTF-8. Just a minor correction, as the forum does indeed use Unicode, but cannot encode most Unicode characters.

Foxpup

legendary

Activity: 4542

Merit: 3393

Vile Vixen and Miss Bitcointalk 2021-2023

Quote from: btcton on July 27, 2013, 02:45:05 PM

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.

The forum doesn't use Unicode. All non-ASCII characters must be converted to the corresponding HTML entity (eg, "©" becomes "©" or "©") in order to be displayed correctly. Without conversion, "©" will actually be displayed as "Â©". You've probably seen this before on sites that don't perform this conversion correctly.

btcton

legendary

Activity: 1302

Merit: 1007

Quote from: dree12 on July 26, 2013, 07:38:22 PM

Quote from: theymos on July 26, 2013, 07:16:21 PM

SMF translates all special characters into HTML entities and all newlines into
s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.

Only stuff that conflicts with HTML such as "<" or sometimes JavaScript need to be translated. Normal characters should be no more than one byte each.

theymos

administrator

Activity: 5222

Merit: 13032

No, numbers aren't translated.

dree12

legendary

Activity: 1246

Merit: 1077

Quote from: theymos on July 26, 2013, 07:16:21 PM

SMF translates all special characters into HTML entities and all newlines into
s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

Are numbers translated too? Because if so, it would seem they are translated back...

Anyways, I guess that's reasonable. Personally, I would have taken a storage hit and stored both a BBCode version in UTF-8 and a cached HTML translation. It would be most efficient, speed-wise (one translation per edit, rather than multiple), and storage is quite cheap (especially for text). IIRC that's what Wikipedia does, and it's a major reason why they can serve so many people so quickly with very few servers.

nimda

hero member

Activity: 784

Merit: 1000

0xFB0D8D1534241423

Quote from: dree12 on July 26, 2013, 06:59:33 PM

Quote from: theymos on July 26, 2013, 05:47:48 PM

There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.

Remember, you're going into HTML. For example, an ampersand (one byte in UTF-8) must become "&" (5 bytes) or "&" (5 bytes). Newlines are encoded as "
" which is 6 bytes.

theymos

administrator

Activity: 5222

Merit: 13032

SMF translates all special characters into HTML entities and all newlines into
s before inserting text into the database. This is maybe more efficient, though I probably wouldn't have done it this way.

dree12

legendary

Activity: 1246

Merit: 1077

Quote from: theymos on July 26, 2013, 05:47:48 PM

There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

This seems absolutely ridiculous. UTF-8 has more characters that can be fit into a byte, and all Unicode characters can be encoded in at most 6 bytes. I assume my post is so large then, because of all the numbers, punctuation, and newlines.

Thank you for inserting my post, though.

theymos

administrator

Activity: 5222

Merit: 13032

There's a 65535-byte limit. Characters not in [a-zA-Z ] require ~6 bytes with SMF's encoding, including newlines.

There should be some sort of warning if you trigger this.

dree12

legendary

Activity: 1246

Merit: 1077

I was making some expansions to this thread recently. When I saved the post, it was cut off. However, the post is well under the 65535-character limit:

(firefox)

Code:

[17:16:06.713] post.length
[17:16:06.719] 62075

No notice came up; the post was just cut off. The post preview worked as expected.

Why, then, was the post cut off?

Topic: Something odd (Read 971 times)