This is something I've been working on since chatting with @TryNinja and @LoyceV about how they repair holes in their post archives after an outage. The way SMF combines topic IDs with message IDs makes it very awkward for scrapers to fill in gaps, because even though they know what ranges of message IDs they're missing, they often don't know what the corresponding topic IDs are.
Normally, the link format for a given (anchored) message, looks like this:
/index.php?topic={topic_id}.msg{message_id}#msg{message_id}This patch changes SMF so that it will also accept an asterisk [1] as the topic ID:
/index.php?topic=*.msg{message_id}#msg{message_id}Here's the diff for @theymos:
--- baseline/Sources/QueryString.php 2011-02-07 16:45:09.000000000 +0000
+++ modified/Sources/QueryString.php 2023-08-23 12:36:25.000000000 +0000
@@ -65,41 +65,41 @@
- makes sure a string only contains character which are allowed in
XML/XHTML (not 0-8, 11, 12, and 14-31.)
- tries to handle UTF-8 properly, and shouldn't negatively affect
character sets like ISO-8859-1.
- does not effect keys, only changes values.
- may call itself recursively if necessary.
string ob_sessrewrite(string buffer)
- rewrites the URLs outputted to have the session ID, if the user
is not accepting cookies and is using a standard web browser.
- handles rewriting URLs for the queryless URLs option.
- can be turned off entirely by setting $scripturl to an empty
string, ''. (it wouldn't work well like that anyway.)
- because of bugs in certain builds of PHP, does not function in
versions lower than 4.3.0 - please upgrade if this hurts you.
*/
// Clean the request variables - add html entities to GET and slashes if magic_quotes_gpc is Off.
function cleanRequest()
{
- global $board, $topic, $boardurl, $scripturl, $modSettings;
+ global $board, $topic, $boardurl, $scripturl, $modSettings, $db_prefix;
// Makes it easier to refer to things this way.
$scripturl = $boardurl . '/index.php';
// Save some memory.. (since we don't use these anyway.)
unset($GLOBALS['HTTP_POST_VARS'], $GLOBALS['HTTP_POST_VARS']);
unset($GLOBALS['HTTP_POST_FILES'], $GLOBALS['HTTP_POST_FILES']);
// These keys shouldn't be set...ever.
if (isset($_REQUEST['GLOBALS']) || isset($_COOKIE['GLOBALS']))
die('Invalid request variable.');
// Same goes for numeric keys.
foreach (array_merge(array_keys($_POST), array_keys($_GET), array_keys($_FILES)) as $key)
if (is_numeric($key))
die('Invalid request variable.');
// Numeric keys in cookies are less of a problem. Just unset those.
foreach ($_COOKIE as $key => $value)
if (is_numeric($key))
@@ -214,40 +214,49 @@
else
$board = 0;
// If there's a threadid, it's probably an old YaBB SE link. Flow with it.
if (isset($_REQUEST['threadid']) && !isset($_REQUEST['topic']))
$_REQUEST['topic'] = $_REQUEST['threadid'];
// We've got topic!
if (isset($_REQUEST['topic']))
{
// Make sure that its a string and not something else like an array
$_REQUEST['topic'] = (string)$_REQUEST['topic'];
// Slash means old, beta style, formatting. That's okay though, the link should still work.
if (strpos($_REQUEST['topic'], '/') !== false)
list ($_REQUEST['topic'], $_REQUEST['start']) = explode('/', $_REQUEST['topic']);
// Dots are useful and fun ;). This is ?topic=1.15.
elseif (strpos($_REQUEST['topic'], '.') !== false)
list ($_REQUEST['topic'], $_REQUEST['start']) = explode('.', $_REQUEST['topic']);
+ // If a message ID was given with a topic ID of '*', then search for (and use) the correct topic ID.
+ if($_REQUEST['topic'] == '*' && !empty($_REQUEST['start']) && substr($_REQUEST['start'], 0, 3) == 'msg')
+ {
+ $result = db_query('SELECT ID_TOPIC FROM ' . $db_prefix . 'messages WHERE ID_MSG = ' . (int)substr($_REQUEST['start'], 3), __FILE__, __LINE__);
+ $row = mysql_fetch_row($result);
+ mysql_free_result($result);
+ $_REQUEST['topic'] = !empty($row) ? (int)$row[0] : -1;
+ }
+
$topic = (int) $_REQUEST['topic'];
// Now make sure the online log gets the right number.
$_GET['topic'] = $topic;
}
else
$topic = 0;
// There should be a $_REQUEST['start'], some at least. If you need to default to other than 0, use $_GET['start'].
if (empty($_REQUEST['start']) || $_REQUEST['start'] < 0 || (int) $_REQUEST['start'] > 2147473647)
$_REQUEST['start'] = 0;
// The action needs to be a string and not an array or anything else
if (isset($_REQUEST['action']))
$_REQUEST['action'] = (string) $_REQUEST['action'];
if (isset($_GET['action']))
$_GET['action'] = (string) $_GET['action'];
// Store the REMOTE_ADDR for later - even though we HOPE to never use it...
$_SERVER['BAN_CHECK_IP'] = isset($_SERVER['REMOTE_ADDR']) && preg_match('~^((([1]?\d)?\d|2[0-4]\d|25[0-5])\.){3}(([1]?\d)?\d|2[0-4]\d|25[0-5])$~', $_SERVER['REMOTE_ADDR']) === 1 ? $_SERVER['REMOTE_ADDR'] : 'unknown';
(I made a previous attempt at this, based on HTTP redirects that would've meant a maximum of 30 requests per minute because of the rate-limiter. I'm hoping that this new redirection-free attempt meets with theymos' approval and ends up simplifying LoyceV's and TryNinja's scrapers and making them more reliable.)[1] An asterisk (*) makes sense to me, and is safe to use according to my reading of RFC 3986, but zero (0) and underscore (_) both make sense too, I guess. (It's easy for theymos to adjust.)