Author

Topic: Decoupling message IDs from topic IDs (SMF patch) (Read 276 times)

legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
September 01, 2023, 05:21:10 AM
#12
The idea would be, when the person makes a quote, the forum will give a maximum line limit, then generating a scroll if there are more lines than this limit.
I was thinking of a manual option, like code-tags, but automating to all posts is even better.
legendary
Activity: 1638
Merit: 4508
**In BTC since 2013**
Ratimov used code-tags to limit the size of the data, but it would be much better if the BBCode would still be processed. Currently it's not possible to add a scroll bar to a quote.

The idea would be, when the person makes a quote, the forum will give a maximum line limit, then generating a scroll if there are more lines than this limit. And that?

It's really not a bad idea, taking into account that some users don't use qoutes correctly and then we have huge texts, when 3 or 4 quoted lines would be enough.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
legendary
Activity: 1638
Merit: 4508
**In BTC since 2013**
Congratulations PowerGlove for another useful and quickly implemented patch.
Personally, I don't even know how to thank you.


Without a doubt, it is a patch that will be very useful for those who like to collect information on the forum.
Happy scraping to everyone who likes to do them.


By the way, just one question (which I almost certainly have the answer to, but to confirm), the numbering of the posts is sequential, correct?
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
administrator
Activity: 5166
Merit: 12850
Added, thanks!

It would be better if message-links didn't have the topic in them, but I'm worried that changing the links in [quote]s etc. would break something, so this will only be used by people in-the-know about it, not in any forum-generated HTML.

(I've also been meaning to write a very simple API which would allow bots to function more efficiently in general, but I haven't gotten to it yet.)
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Interestingly, if you misuse this feature, and try to build a complete scraper from it, then it returning multiple posts per request actually helps efficiency. What you would do is start from post 1 (/index.php?topic=*.msg1) and then parse all the posts that come back (if any), being careful to (conceptually) knock their message IDs off the list before advancing to the next post. In practice, that'll probably amount to keeping track of already-visited messages with some kind of data structure that allows for doing lookups in sublinear time (like a map, or a set).
Been there, done that Cheesy But it make much more sense to use topicIDs instead of msgIDs to scrape everything.
hero member
Activity: 510
Merit: 4005
Good job yet again PowerGlove.
Thanks, man. Wink

This would make a huge difference for current and future features on multiple projects of mine, thank you for coding it, @PowerGlove.
No problem, it's cool to hear that it'll make a difference to your projects. I enjoy messing around with SMF, but it means a lot to me when my stuff ends up making a positive difference (of course, I'm at the mercy of theymos' preferences and schedule, but I think there's a pretty good chance that this one will get merged).

I guess an asterisk was easier to patch than isolating just one post?
Yup, this patch is basically just one if statement with a 4-line body, placed in just the right spot. I could make it more complicated and isolate a single post, but I don't think that's worth the complication.

When this feature is being used for its intended purpose (helping scrapers to fill in gaps in their archives, and gracefully recover from DDoS attacks, etc.), I don't think efficiency is a major factor; like you said, you only have a few thousand posts missing from your archive.

Interestingly, if you misuse this feature, and try to build a complete scraper from it, then it returning multiple posts per request actually helps efficiency. What you would do is start from post 1 (/index.php?topic=*.msg1) and then parse all the posts that come back (if any), being careful to (conceptually) knock their message IDs off the list before advancing to the next post. In practice, that'll probably amount to keeping track of already-visited messages with some kind of data structure that allows for doing lookups in sublinear time (like a map, or a set).

One question: will the resulting page also show the topicID somewhere?
Yup, the asterisk is (mostly) transparent to the rest of SMF, so the page will render just as if you had supplied the correct topic ID.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
I guess an asterisk was easier to patch than isolating just one post? It does mean the server has to look up (up to) 20 posts for each request, but I guess that's tiny compared to the rest of the page loads.
One question: will the resulting page also show the topicID somewhere? If so, I can work with this and add the (couple of thousand) missing posts to my archive. Thanks for writing this!

Please, theymos… Grin
^What he says^
legendary
Activity: 2758
Merit: 6830
This would make a huge difference for current and future features on multiple projects of mine, thank you for coding it, @PowerGlove.

Please, theymos… Grin

I'm not exactly certain what the patch does even after reading the thread twice
Good job yet again PowerGlove.
Sometimes my archive is missing a post of id XXXX and I can’t scrape it because I have no idea on which topic the post is. With this change, I can find it without knowing the topic id.
legendary
Activity: 2030
Merit: 2174
Professional Community manager
I'm not exactly certain what the patch does even after reading the thread twice and don't know the problem it is trying to solve, but if it helps make the work of TryNinja and LoyceV easier then I am all down for it.

Good job yet again PowerGlove.
hero member
Activity: 510
Merit: 4005
This is something I've been working on since chatting with @TryNinja and @LoyceV about how they repair holes in their post archives after an outage. The way SMF combines topic IDs with message IDs makes it very awkward for scrapers to fill in gaps, because even though they know what ranges of message IDs they're missing, they often don't know what the corresponding topic IDs are.

Normally, the link format for a given (anchored) message, looks like this: /index.php?topic={topic_id}.msg{message_id}#msg{message_id}

This patch changes SMF so that it will also accept an asterisk [1] as the topic ID: /index.php?topic=*.msg{message_id}#msg{message_id}

Here's the diff for @theymos:

Code:
--- baseline/Sources/QueryString.php 2011-02-07 16:45:09.000000000 +0000
+++ modified/Sources/QueryString.php 2023-08-23 12:36:25.000000000 +0000
@@ -65,41 +65,41 @@
  - makes sure a string only contains character which are allowed in
    XML/XHTML (not 0-8, 11, 12, and 14-31.)
  - tries to handle UTF-8 properly, and shouldn't negatively affect
    character sets like ISO-8859-1.
  - does not effect keys, only changes values.
  - may call itself recursively if necessary.
 
  string ob_sessrewrite(string buffer)
  - rewrites the URLs outputted to have the session ID, if the user
    is not accepting cookies and is using a standard web browser.
  - handles rewriting URLs for the queryless URLs option.
  - can be turned off entirely by setting $scripturl to an empty
    string, ''. (it wouldn't work well like that anyway.)
  - because of bugs in certain builds of PHP, does not function in
    versions lower than 4.3.0 - please upgrade if this hurts you.
 */
 
 // Clean the request variables - add html entities to GET and slashes if magic_quotes_gpc is Off.
 function cleanRequest()
 {
- global $board, $topic, $boardurl, $scripturl, $modSettings;
+ global $board, $topic, $boardurl, $scripturl, $modSettings, $db_prefix;
 
  // Makes it easier to refer to things this way.
  $scripturl = $boardurl . '/index.php';
 
  // Save some memory.. (since we don't use these anyway.)
  unset($GLOBALS['HTTP_POST_VARS'], $GLOBALS['HTTP_POST_VARS']);
  unset($GLOBALS['HTTP_POST_FILES'], $GLOBALS['HTTP_POST_FILES']);
 
  // These keys shouldn't be set...ever.
  if (isset($_REQUEST['GLOBALS']) || isset($_COOKIE['GLOBALS']))
  die('Invalid request variable.');
 
  // Same goes for numeric keys.
  foreach (array_merge(array_keys($_POST), array_keys($_GET), array_keys($_FILES)) as $key)
  if (is_numeric($key))
  die('Invalid request variable.');
 
  // Numeric keys in cookies are less of a problem. Just unset those.
  foreach ($_COOKIE as $key => $value)
  if (is_numeric($key))
@@ -214,40 +214,49 @@
  else
  $board = 0;
 
  // If there's a threadid, it's probably an old YaBB SE link.  Flow with it.
  if (isset($_REQUEST['threadid']) && !isset($_REQUEST['topic']))
  $_REQUEST['topic'] = $_REQUEST['threadid'];
 
  // We've got topic!
  if (isset($_REQUEST['topic']))
  {
  // Make sure that its a string and not something else like an array
  $_REQUEST['topic'] = (string)$_REQUEST['topic'];
 
  // Slash means old, beta style, formatting.  That's okay though, the link should still work.
  if (strpos($_REQUEST['topic'], '/') !== false)
  list ($_REQUEST['topic'], $_REQUEST['start']) = explode('/', $_REQUEST['topic']);
  // Dots are useful and fun ;).  This is ?topic=1.15.
  elseif (strpos($_REQUEST['topic'], '.') !== false)
  list ($_REQUEST['topic'], $_REQUEST['start']) = explode('.', $_REQUEST['topic']);
 
+ // If a message ID was given with a topic ID of '*', then search for (and use) the correct topic ID.
+ if($_REQUEST['topic'] == '*' && !empty($_REQUEST['start']) && substr($_REQUEST['start'], 0, 3) == 'msg')
+ {
+ $result = db_query('SELECT ID_TOPIC FROM ' . $db_prefix . 'messages WHERE ID_MSG = ' . (int)substr($_REQUEST['start'], 3), __FILE__, __LINE__);
+ $row = mysql_fetch_row($result);
+ mysql_free_result($result);
+ $_REQUEST['topic'] = !empty($row) ? (int)$row[0] : -1;
+ }
+
  $topic = (int) $_REQUEST['topic'];
 
  // Now make sure the online log gets the right number.
  $_GET['topic'] = $topic;
  }
  else
  $topic = 0;
 
  // There should be a $_REQUEST['start'], some at least.  If you need to default to other than 0, use $_GET['start'].
  if (empty($_REQUEST['start']) || $_REQUEST['start'] < 0 || (int) $_REQUEST['start'] > 2147473647)
  $_REQUEST['start'] = 0;
 
  // The action needs to be a string and not an array or anything else
  if (isset($_REQUEST['action']))
  $_REQUEST['action'] = (string) $_REQUEST['action'];
  if (isset($_GET['action']))
  $_GET['action'] = (string) $_GET['action'];
 
  // Store the REMOTE_ADDR for later - even though we HOPE to never use it...
  $_SERVER['BAN_CHECK_IP'] = isset($_SERVER['REMOTE_ADDR']) && preg_match('~^((([1]?\d)?\d|2[0-4]\d|25[0-5])\.){3}(([1]?\d)?\d|2[0-4]\d|25[0-5])$~', $_SERVER['REMOTE_ADDR']) === 1 ? $_SERVER['REMOTE_ADDR'] : 'unknown';

(I made a previous attempt at this, based on HTTP redirects that would've meant a maximum of 30 requests per minute because of the rate-limiter. I'm hoping that this new redirection-free attempt meets with theymos' approval and ends up simplifying LoyceV's and TryNinja's scrapers and making them more reliable.)

[1] An asterisk (*) makes sense to me, and is safe to use according to my reading of RFC 3986, but zero (0) and underscore (_) both make sense too, I guess. (It's easy for theymos to adjust.)
Jump to: