Real-Life Example: Using Proxies to Avoid Scraping Bans

Learn how to use proxies to bypass anti-bot measures and avoid detection while web scraping. This in-depth guide covers effective strategies and real-world examples for long-term, successful data extraction.

Aproxy 团队

use proxy

Web scraping has become an essential tool for businesses and developers looking to extract valuable data from websites. However, modern websites employ sophisticated anti-bot measures that can quickly detect and block scraping activities. Using proxies effectively is crucial for avoiding bans and maintaining successful long-term scraping operations. This comprehensive guide explores proven strategies and real-world examples of how to use proxies to avoid detection while scraping.

Understanding Why Websites Block Scrapers

Websites implement blocking mechanisms for several critical reasons. They need to prevent excessive load on their servers, protect copyrighted content and intellectual property, avoid loss of ad revenue from scraped content, maintain competitive advantage from their data, and ensure a good user experience for legitimate human visitors. According to recent studies, bad bots (including scrapers) accounted for 25.6% of all website traffic in 2021, representing a 10.4% increase from the previous year.

Anti-bot systems detect scraping activities by monitoring for suspicious patterns such as high request rates from single IP addresses, unusual access patterns like rapidly accessing many pages, invalid or missing user agent strings, and requests originating from data center IP addresses commonly used by scrapers. When these red flags are detected, websites may block IP addresses, present CAPTCHAs, or implement other

The Foundation: Understanding Proxy Rotation

Proxy rotation is the cornerstone of effective web scraping without getting blocked. This technique involves changing the IP address used for requests at regular intervals, either after a specific number of requests or at predetermined time intervals. The fundamental principle is to make each request appear to come from a different machine or location, making it extremely difficult for anti-bot measures to detect and block scraping activities.

Proxy rotation serves multiple critical functions in web scraping operations. It helps avoid IP bans and rate limits by distributing requests across multiple IP addresses, preventing any single IP from triggering security mechanisms. It reduces the risk of detection by mimicking the behavior of multiple users accessing a site from different locations, making traffic appear more organic. Additionally, it enables access to geographical content by using proxies from different regions, and allows for concurrent requests without overwhelming the target website from a single IP address.

Real-Life Case Study: The $3,600 Proxy Disaster

A compelling real-world example demonstrates both the power and potential pitfalls of proxy usage in web scraping. A data extraction company attempted to scrape Stone Island's e-commerce website, which was protected by Akamai's advanced anti-bot system. Initially, the company used residential proxies with a 99.3% success rate for other websites, but Stone Island presented unique challenges.

The Initial Approach and Failure

The scraper kept getting redirected to JavaScript challenges before throwing errors for maximum redirections reached. The team first tried using Nimble Browser, an AI-powered browser solution, but quickly exhausted their 100,000 request credits without completing a single successful run.

The Expensive Mistake

The team then switched to using Playwright with residential proxies, implementing what seemed like a reasonable approach. However, they made several critical errors that led to catastrophic costs:

Shared credentials across multiple websites: They used the same proxy credentials for four different websites, making it impossible to track individual website usage and set appropriate thresholds.
Inadequate resource blocking: While they blocked images to save bandwidth, they failed to block third-party resources. For every GB spent on the actual target website, they consumed at least 5 GB on third-party services they didn't need.
No usage thresholds: They didn't implement safeguards to prevent runaway proxy usage, which proved costly when the website exhibited unexpected behavior.

The Root Cause

Analysis revealed that Stone Island's website was loading third-party resources extensively, and some pages entered endless loading loops when accessed from data center environments. This combination resulted in massive proxy bandwidth consumption without successful data extraction.

Proven Strategies for Effective Proxy Usage

1. Implement Smart Proxy Rotation

import requests
import random
from itertools import cycle

# Sequential rotation
def sequential_proxy_rotation():
    proxies_list = open("proxies_list.txt").read().strip().split("\n")
    proxy_pool = cycle(proxies_list)

    for _ in range(4):
        proxy = next(proxy_pool)
        proxies = {"http": f"http://{proxy}", "https": f"http://{proxy}"}
        try:
            response = requests.get("https://httpbin.io/ip", proxies=proxies, timeout=10)
            print(f"Success with proxy: {proxy}" if response.status_code == 200 else f"Failed with status: {response.status_code}")
        except Exception as e:
            print(f"Error with proxy {proxy}: {e}")

# Random rotation
def random_proxy_rotation():
    proxies_list = open("proxies_list.txt").read().strip().split("\n")

    for _ in range(4):
        random_proxy = random.choice(proxies_list)
        proxies = {"http": f"http://{random_proxy}", "https": f"http://{random_proxy}"}
        try:
            response = requests.get("https://httpbin.io/ip", proxies=proxies, timeout=10)
            print(f"Success with proxy: {random_proxy}")
        except Exception as e:
            print(f"Error with proxy {random_proxy}: {e}")

2. Control Request Rate and Timing

import time
import random

def make_request_with_delay(url, proxies):
    time.sleep(random.uniform(10, 15))
    time.sleep(delay)
    try:
        response = requests.get(url, proxies=proxies, timeout=10)
        return response
    except Exception as e:
        print(f"Request failed: {e}")
        return None

3. Implement User-Agent Rotation

import random

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0...)",
    "Mozilla/5.0 (Macintosh; Intel...)",
    "Mozilla/5.0 (Linux; Android...)"
]

def get_random_user_agent():
    return random.choice(user_agents)

def make_request_with_rotation(url, proxies):
    headers = {
        'User-Agent': get_random_user_agent()
    }
    return requests.get(url, proxies=proxies, headers=headers)
    return response

4. Choose the Right Proxy Type

Use residential proxies for stealth and datacenter proxies for speed. Mixing both reduces detection risk by 21%.

5. Implement Session Management

class ProxySession:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
        self.sessions = {}

    def get_session(self, session_id):
        if session_id not in self.sessions:
            proxy = random.choice(self.proxy_list)
            session = requests.Session()
            session.proxies = {'http': f'http://{proxy}', 'https': f'http://{proxy}'}
            self.sessions[session_id] = session
        return self.sessions[session_id]

Best Practices for Avoiding Proxy Bans

Monitor and Manage Proxy Health

def monitor_proxy_health(proxy_list):
    healthy_proxies = []
    for proxy in proxy_list:
        try:
            proxies = {'http': f'http://{proxy}', 'https': f'http://{proxy}'}
            response = requests.get('https://httpbin.io/ip', proxies={'http': f'http://{proxy}', 'https': f'http://{proxy}'}, timeout=5)
            if response.status_code == 200:
                healthy_proxies.append(proxy)
                print(f"Proxy {proxy} is healthy") 
            else: 
                print(f"Proxy {proxy} returned status {response.status_code}") 
         except Exception as e: 
                print(f"Proxy {proxy} failed: {e}") 
     return healthy_proxies

Set Usage Thresholds and Alerts

class ProxyUsageMonitor:
    def __init__(self, max_requests=1000, max_bandwidth_gb=1.0):
        self.max_requests = max_requests
        self.max_bandwidth_gb = max_bandwidth_gb
        self.request_count = 0
        self.bandwidth_used = 0.0

    def check_limits(self):
         if self.request_count >= self.max_requests:
            raise Exception(f"Request limit exceeded: {self.request_count}") 
         if self.bandwidth_used >= self.max_bandwidth_gb:
            raise Exception(f"Bandwidth limit exceeded: {self.bandwidth_used} GB")

    def log_request(self, response_size_bytes):
        self.request_count += 1
        self.bandwidth_used += response_size_bytes / (1024**3)
        self.check_limits()
        self.check_limits()Use Separate Credentials

Separate credentials by site to track and manage resource usage precisely.

Block Unnecessary Resources

def setup_resource_blocking(page):
    def route_intercept(route):
        url = route.request.url
        resource_type = route.request.resource_type

        if route.request.resource_type in ["image", "stylesheet"] 
            return route.abort()
        if "target-domain.com" not in url:
            return route.abort()

        return route.continue_()
    
    page.route("**/*", route_intercept)

Real-World Implementation: E-commerce Scraping

Even residential proxies failed on Amazon. Only unblocker services achieved full data extraction. This shows the limits of proxies alone against high-security websites.