Author

Topic: [solved] @theymos: get an exception from Cloudflare for scrapers? (Read 242 times)

copper member
Activity: 1526
Merit: 2890
My scraper got whitelisted! Thanks theymos Smiley
Whitelisting is, as theymos called it, a "janky" solution. Downloading All for instance still gives:
Code:
ERROR 503: Service Temporarily Unavailable.
I don't need "All" so that's okay. Let's see if this works next time Cloudflare kills bots again.

Ninjastic.space is whitelisted too.

Wow congratulations you guys deserve it.

Just wondering how does it work? I mean if I’m using my script on and off, for some personal but very selective scrapping (of course for me I can wait a day or two to let the Cloudflare cool down) but still want to know is there a chance to request for it?

I think theymos didn’t make it public and this was done internally.

LoyceV and TryNinja enjoy the special privileges… if bitcointalk goes down I know where to go next Smiley
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Charging for access is common when some project is wanting to access a large amount of data. Commercial or otherwise, scrapers take up forum resources that are greater than the typical user, likely by a lot.
Probably. But (at least the known) scrapers give something back to the community.

the forum's mission to be as free as possible.
I'd say scraping is part of that. The "1 page per second" rule still applies after whitelisting.
copper member
Activity: 1610
Merit: 1898
Amazon Prime Member #7
Allowing scrappers has already made all forum posts and other data freely searchable, so unless this will change, I see no reason to it implement an api for people like loyce. It would probably make sense to charge for said access
New access structure for loyce.club: Unedited post: $2. Viewing a Trust list: $2. Notifications: $0.50 each. Somehow charging fees for non-commercial community projects doesn't seem right.
Charging for access is common when some project is wanting to access a large amount of data. Commercial or otherwise, scrapers take up forum resources that are greater than the typical user, likely by a lot. Scrapers also have zero chance of even looking at any ads that are served upon forum users, which negatively affects forum revenue.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
In this case it may not have mattered. Cloudflare was having a real bad morning today. Got the calls from a few clients that they could not get to their own sites from the own offices that are in the IP whitelist.

Having your IP(s) in it can't hurt, but when Cloudflare themselves are having problems I don't think it's going to matter.
If it's a large problem on Cloudflare's side, I would have expected them to fix it in much less than 8.5 hours. Considering how widely used they are, that's terrible.

I think a more reliable solution would be to have purpose-built scraping endpoints that take an API key. The endpoints could be defined in such a way to allow for checksum-based data reconciliation. That way, whenever there are outages (Cloudflare or otherwise) the missing data could be correctly retrieved after the fact, and there would never be any doubt as to whether you're one-to-one with the master data.
Currently, some of the "master data" doesn't exist: if a user edits their post within 10 minutes, the original is lost from the forum.

Allowing scrappers has already made all forum posts and other data freely searchable, so unless this will change, I see no reason to it implement an api for people like loyce. It would probably make sense to charge for said access
New access structure for loyce.club: Unedited post: $2. Viewing a Trust list: $2. Notifications: $0.50 each. Somehow charging fees for non-commercial community projects doesn't seem right.



My scraper got whitelisted! Thanks theymos Smiley
Whitelisting is, as theymos called it, a "janky" solution. Downloading All for instance still gives:
Code:
ERROR 503: Service Temporarily Unavailable.
I don't need "All" so that's okay. Let's see if this works next time Cloudflare kills bots again.

Ninjastic.space is whitelisted too.
legendary
Activity: 3500
Merit: 6205
Looking for campaign manager? Contact icopress!
Let's get real, especially these 3 scrapers help the forum a lot, filling the gaps with features we don't have.. for example because the forum software is old.
I don't disagree with you about these three users. There is another one or better to say two more we can add. Ddmr and tranthidung as far as I know. But in these process we are trusting these people. People changes, they will change too. Any bad intend could create trouble for others. In the mean time you have no choice too.

Yes, I've noticed soon enough that I've missed at least DdmrDdmr, but for the sake of example I was fine, hence I didn't edit the post.
I also wrote there about whitelisting. Whitelisting can go in both directions (can be disabled or even become blacklisting). When somebody (who got whitelisting for his scraper) is no longer trusted (and we know how easy is for that to happen) his whitelisting will no longer be granted.

PS. Of course, API keys would be more professional, but I don't dare to even dream about that.
legendary
Activity: 2702
Merit: 2645
Farewell LEO: o_e_l_e_o
Let's get real, especially these 3 scrapers help the forum a lot, filling the gaps with features we don't have.. for example because the forum software is old.
I don't disagree with you about these three users. There is another one or better to say two more we can add. Ddmr and tranthidung as far as I know. But in these process we are trusting these people. People changes, they will change too. Any bad intend could create trouble for others. In the mean time you have no choice too.


I think a more reliable solution would be to have purpose-built scraping endpoints that take an API key.[...]
Off-topic: I can imagine a set of new endpoints being used to power a Bitcointalk mobile app. Also, I can imagine the same being used as a way to bootstrap the new forum software (as opposed to a point-in-time data migration) which might allow for the old forum and the new one to happily co-exist for a time.
We can dream for it if we expect Theymos to do something about it. But since you have a good hand. You can try like the OP patch and hand over the codes to Theymos.

Is it a bad thing to give an advantage to users who create something that some people consider useful?
A knife can be used to kill someone or to prepare your veg. It's all down to the intention of use of a tool.
copper member
Activity: 1610
Merit: 1898
Amazon Prime Member #7
Agree with PowerGlove. It would be superior to allow people to access posts (and other forum data) via an api endpoint. This would probably actually reduce the number of page views because a “scrapper” could gather posts in batches rather than in real time.

Allowing scrappers has already made all forum posts and other data freely searchable, so unless this will change, I see no reason to it implement an api for people like loyce. It would probably make sense to charge for said access
legendary
Activity: 1568
Merit: 6660
bitcoincleanup.com / bitmixlist.org
Lynx browsing also doesn't work, Cloudflare is gating me at the "checking your connection" screen.

Why the sudden increase in Cloudflare challenges anyway? There has been no DDoS attack recently.
hero member
Activity: 510
Merit: 3981
I think a more reliable solution would be to have purpose-built scraping endpoints that take an API key. The endpoints could be defined in such a way to allow for checksum-based data reconciliation. That way, whenever there are outages (Cloudflare or otherwise) the missing data could be correctly retrieved after the fact, and there would never be any doubt as to whether you're one-to-one with the master data. It wouldn't be easy to pull off, but I'm pretty confident I could safely build something like this (SMF programming is truly painful, but now that I've spent the effort learning it, it would be gratifying to put those skills to good use).

Off-topic: I can imagine a set of new endpoints being used to power a Bitcointalk mobile app. Also, I can imagine the same being used as a way to bootstrap the new forum software (as opposed to a point-in-time data migration) which might allow for the old forum and the new one to happily co-exist for a time.
hero member
Activity: 1428
Merit: 836
Top Crypto Casino
Have you tried to recreate your bot maybe?
Recreate? It will be a waste of time and resources.
What would be the difference if the main problem is in cloudflare itself. Also there are lots of cases that cloudflare is not doing good to other sites too these past few days. So better be patience and wait till they fix it or unless theymos tweak something on the config or something.
newbie
Activity: 1
Merit: 0
Have you tried to recreate your bot maybe?
legendary
Activity: 3458
Merit: 6231
Crypto Swap Exchange
In this case it may not have mattered. Cloudflare was having a real bad morning today. Got the calls from a few clients that they could not get to their own sites from the own offices that are in the IP whitelist.

Having your IP(s) in it can't hurt, but when Cloudflare themselves are having problems I don't think it's going to matter.

And yeah, you have to figure there are a lot of people doing the same thing they are just not talking about it.

-Dave
legendary
Activity: 3500
Merit: 6205
Looking for campaign manager? Contact icopress!
May be I can trust Loyce.club, the same with Ninjastics or BPIP but if we give them advantage over others then we are creating double standard.  Especially when this forum is about an economy like Bitcoin, we really need to be careful about what information we post and who keep copies of our information.

You do have a point about the double standard, but I guess that it's not so difficult to make a page explaining the situation and everybody who wants to have a scraper can get his IP whitelisted, for example, as long as it's an okay user of the forum.
Let's get real, especially these 3 scrapers help the forum a lot, filling the gaps with features we don't have.. for example because the forum software is old.

I did miss the notifications today, so maybe I am somewhat biased, but I think that it would be nice to not have many problems like this.
However, I have a feeling that it was a bug/mishap, not an intentional blockade, so maybe everything can go on as usual.
Of course, I understand OP's frustration. He will probably need to do extra work because of all this so he can feed his tool correctly with the missing data... so let's not be too harsh on him either. It's okay to ask; whether it will be granted or not... that's a different story and I trust theymos that he will see this topic and take the right decision.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
The Cloudflare or any other DDOS protection bot were supposed to be introduced to prevent scrapping.
Scraping isn't the same as DDOS, and an exception for a few IP-addresses won't help any DDOS.

Quote
I am sorry if it sounds harsh but I don't think many people over the internet like the way the scrappers are scrapping every information that are published intentionally or unintentionally. Everyone have their rights to remove any information they want. But it's unknown how many bots out there and scrapping everything without our knowledge.
Once you post something online, you should consider it compromised forever. Only a few scrapers publicly announce what they're doing, but you can bet there are others who quietly store everything.

Quote
May be I can trust Loyce.club, the same with Ninjastics or BPIP but if we give them advantage over others then we are creating double standard.
Is it a bad thing to give an advantage to users who create something that some people consider useful?
legendary
Activity: 2702
Merit: 2645
Farewell LEO: o_e_l_e_o
Public request to theymos: Is it possible to get an exception from Cloudflare for popular scraping bots? I'm thinking of BPIP.org, Ninjastic.space and loyce.club, but there are probably a few others that provide useful features for the Bitcointalk community. It would be really nice if DDOS-protection doesn't interfere with features.
The Cloudflare or any other DDOS protection bot were supposed to be introduced to prevent scrapping. I am sorry if it sounds harsh but I don't think many people over the internet like the way the scrappers are scrapping every information that are published intentionally or unintentionally. Everyone have their rights to remove any information they want. But it's unknown how many bots out there and scrapping everything without our knowledge.

May be I can trust Loyce.club, the same with Ninjastics or BPIP but if we give them advantage over others then we are creating double standard.  Especially when this forum is about an economy like Bitcoin, we really need to be careful about what information we post and who keep copies of our information.
legendary
Activity: 3290
Merit: 16489
Thick-Skinned Gang Leader and Golden Feather 2021
Today, Cloudflare was blocking scraping bots again. For 8 hours and 24 minutes, my scraper wasn't working. Anything between posts 61513760 and 61515614 (1853 posts) is missing, and amongst others my notification bot wasn't working. I already delayed my Merit and Trust data updates (which normally start on Friday and Saturday morning).
This happened 18 days ago too.

Public request to theymos: Is it possible to get an exception from Cloudflare for popular scraping bots? I'm thinking of BPIP.org, Ninjastic.space and loyce.club, but there are probably a few others that provide useful features for the Bitcointalk community. It would be really nice if DDOS-protection doesn't interfere with features.
Jump to: