Building a Scalable Web Scraper with Playwright and Node.js

November 7, 2024

Web scraping is one of the most practical applications of Node.js automation. Whether you're aggregating real estate data, monitoring competitor pricing, or building datasets for analysis, a robust scraper can save hundreds of hours of manual work.

In this Node.js tutorial, we'll build a production-ready apartment listing scraper that demonstrates:

  • Playwright automation for JavaScript-heavy sites
  • CLI architecture with multiple scraper modes
  • Parallel execution for faster data collection
  • Anti-scraping countermeasures and timeout tuning
  • Data filtering and CSV export for analysis
  • Ethical considerations and rate limiting

This isn't a toy example—it's a real scraper I built to aggregate apartment data from Apartments.com for market analysis. The techniques apply to any complex, JavaScript-driven website.

Why Playwright Over Puppeteer or Cheerio?

When building web scrapers in Node.js, you have several options:

| Tool | Use Case | Pros | Cons | |------|----------|------|------| | Cheerio | Static HTML parsing | Fast, lightweight | No JavaScript execution | | Puppeteer | Chrome automation | Mature, good docs | Chrome-only | | Playwright | Cross-browser automation | Multi-browser, better API | Slightly heavier |

For modern sites like Apartments.com that heavily rely on JavaScript for pagination and dynamic content, Playwright is the clear winner:

  • ✅ Handles JavaScript-driven pagination
  • ✅ Cross-browser support (Chromium, Firefox, WebKit)
  • ✅ Better timeout handling and retry logic
  • ✅ Auto-waiting for elements (no more setTimeout hacks)
  • ✅ Built-in network interception

Project Setup

Initialize Node.js Project

mkdir apartments-scraper && cd apartments-scraper
npm init -y

# Install dependencies
npm install playwright csv-writer yargs dotenv
npm install --save-dev eslint prettier

Project Structure

apartments-scraper/
├── package.json
├── .env
├── cli.js                 # Entry point
├── src/
   ├── scrapers/
      ├── list-scraper.js      # Crawls listing pages
      ├── detail-scraper.js    # Extracts apartment details
      └── base-scraper.js      # Shared scraper logic
   ├── filters/
      └── apartment-filter.js  # Filter logic
   ├── utils/
      ├── csv-exporter.js      # CSV export
      └── logger.js            # Logging utility
   └── config.js                # Configuration
├── data/
   └── apartments.csv           # Output
└── README.md

Base Scraper: Shared Playwright Logic

First, let's create a base class with common Playwright setup:

// src/scrapers/base-scraper.js
const { chromium } = require('playwright');
const config = require('../config');

class BaseScraper {
  constructor() {
    this.browser = null;
    this.context = null;
  }

  async initialize() {
    console.log('🚀 Launching browser...');

    this.browser = await chromium.launch({
      headless: config.browser.headless,
      slowMo: config.browser.slowMo,
    });

    // Create context with realistic user agent
    this.context = await this.browser.newContext({
      userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
      viewport: { width: 1920, height: 1080 },
    });

    console.log('✅ Browser ready');
  }

  async createPage() {
    if (!this.context) {
      await this.initialize();
    }

    const page = await this.context.newPage();

    // Set default timeout
    page.setDefaultTimeout(config.browser.timeout);

    // Handle console messages for debugging
    page.on('console', (msg) => {
      if (msg.type() === 'error') {
        console.error(`[Browser Error] ${msg.text()}`);
      }
    });

    return page;
  }

  async close() {
    if (this.browser) {
      await this.browser.close();
      console.log('🔒 Browser closed');
    }
  }

  async delay(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

module.exports = BaseScraper;

List Scraper: Crawling Pagination

The list scraper navigates through paginated results to collect all apartment URLs:

// src/scrapers/list-scraper.js
const BaseScraper = require('./base-scraper');
const config = require('../config');

class ListScraper extends BaseScraper {
  async scrapeListings(baseUrl, maxPages = null) {
    const page = await this.createPage();
    const apartmentUrls = new Set();

    try {
      console.log(`📄 Starting list scrape: ${baseUrl}`);

      await page.goto(baseUrl, { waitUntil: 'networkidle' });

      let currentPage = 1;

      while (true) {
        console.log(`\\n📑 Scraping page ${currentPage}...`);

        // Extract apartment URLs from current page
        const urls = await this.extractUrlsFromPage(page);
        urls.forEach((url) => apartmentUrls.add(url));

        console.log(`   Found ${urls.length} listings (${apartmentUrls.size} total)`);

        // Check if we've reached max pages
        if (maxPages && currentPage >= maxPages) {
          console.log(`⏹️  Reached max pages (${maxPages})`);
          break;
        }

        // Try to navigate to next page
        const hasNext = await this.goToNextPage(page);
        if (!hasNext) {
          console.log('✅ No more pages');
          break;
        }

        currentPage++;

        // Rate limiting: delay between pages
        await this.delay(config.scraper.requestDelay);
      }

      return Array.from(apartmentUrls);
    } catch (error) {
      console.error(`❌ List scraping error: ${error.message}`);
      throw error;
    } finally {
      await page.close();
    }
  }

  async extractUrlsFromPage(page) {
    // Wait for listings container to load
    await page.waitForSelector('[data-testid="property-card"]', {
      timeout: 10000,
    });

    // Extract href from each property card
    return page.evaluate(() => {
      const cards = Array.from(document.querySelectorAll('[data-testid="property-card"]'));
      return cards
        .map((card) => {
          const link = card.querySelector('a[href*="/apartments/"]');
          return link ? link.href : null;
        })
        .filter(Boolean);
    });
  }

  async goToNextPage(page) {
    try {
      // Look for "Next" button
      const nextButton = await page.$('button[title="Next Page"]');

      if (!nextButton) {
        return false;
      }

      // Check if button is disabled
      const isDisabled = await nextButton.getAttribute('disabled');
      if (isDisabled !== null) {
        return false;
      }

      // **Force click** (important for JavaScript-driven pagination!)
      await nextButton.click({ force: true });

      // Wait for navigation
      await page.waitForLoadState('networkidle');

      return true;
    } catch (error) {
      console.log(`   ⚠️  Could not navigate to next page: ${error.message}`);
      return false;
    }
  }
}

module.exports = ListScraper;

Detail Scraper: Extracting Apartment Data

Now the detail scraper visits each apartment page and extracts structured data:

// src/scrapers/detail-scraper.js
const BaseScraper = require('./base-scraper');
const config = require('../config');

class DetailScraper extends BaseScraper {
  async scrapeApartments(urls) {
    console.log(`\\n🏢 Starting detail scrape for ${urls.length} apartments`);

    const apartments = [];
    const errors = [];

    // **Parallel execution** for speed
    const batchSize = config.scraper.maxConcurrent;

    for (let i = 0; i < urls.length; i += batchSize) {
      const batch = urls.slice(i, i + batchSize);
      console.log(`\\n📦 Processing batch ${Math.floor(i / batchSize) + 1}...`);

      const results = await Promise.allSettled(
        batch.map((url) => this.scrapeApartmentDetail(url))
      );

      results.forEach((result, index) => {
        if (result.status === 'fulfilled' && result.value) {
          apartments.push(result.value);
        } else {
          errors.push({
            url: batch[index],
            error: result.reason?.message || 'Unknown error',
          });
        }
      });

      // Rate limiting between batches
      if (i + batchSize < urls.length) {
        await this.delay(config.scraper.requestDelay);
      }
    }

    console.log(`\\n✅ Scraped ${apartments.length} apartments`);
    if (errors.length > 0) {
      console.log(`⚠️  ${errors.length} errors`);
    }

    return { apartments, errors };
  }

  async scrapeApartmentDetail(url) {
    const page = await this.createPage();

    try {
      await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 20000 });

      // Extract data
      const data = await page.evaluate(() => {
        // Helper functions
        const getText = (selector) => {
          const el = document.querySelector(selector);
          return el ? el.textContent.trim() : null;
        };

        const getAll = (selector) => {
          return Array.from(document.querySelectorAll(selector)).map((el) =>
            el.textContent.trim()
          );
        };

        // Extract apartment details
        return {
          name: getText('h1[data-testid="property-name"]'),
          price: getText('[data-testid="price"]'),
          beds: getText('[data-testid="bed-count"]'),
          baths: getText('[data-testid="bath-count"]'),
          sqft: getText('[data-testid="square-feet"]'),
          address: getText('[data-testid="property-address"]'),
          phone: getText('[data-testid="phone-number"]'),
          amenities: getAll('[data-testid="amenity-item"]'),
          petPolicy: getText('[data-testid="pet-policy"]'),
          neighborhood: getText('[data-testid="neighborhood"]'),
        };
      });

      // Clean and parse data
      return this.parseApartmentData(data, url);
    } catch (error) {
      console.error(`   ❌ Error scraping ${url}: ${error.message}`);
      throw error;
    } finally {
      await page.close();
    }
  }

  parseApartmentData(raw, url) {
    // Extract numeric price
    const priceMatch = raw.price?.match(/\\$([\\d,]+)/);
    const price = priceMatch ? parseInt(priceMatch[1].replace(/,/g, '')) : null;

    // Extract numeric bedroom count
    const bedsMatch = raw.beds?.match(/(\\d+)/);
    const beds = bedsMatch ? parseInt(bedsMatch[1]) : null;

    // Extract numeric bathroom count
    const bathsMatch = raw.baths?.match(/([\\d.]+)/);
    const baths = bathsMatch ? parseFloat(bathsMatch[1]) : null;

    // Extract numeric square footage
    const sqftMatch = raw.sqft?.match(/([\\d,]+)/);
    const sqft = sqftMatch ? parseInt(sqftMatch[1].replace(/,/g, '')) : null;

    return {
      url,
      name: raw.name,
      price,
      beds,
      baths,
      sqft,
      pricePerSqft: price && sqft ? Math.round(price / sqft) : null,
      address: raw.address,
      neighborhood: raw.neighborhood,
      phone: raw.phone,
      amenities: raw.amenities?.join(', ') || '',
      petPolicy: raw.petPolicy,
      scrapedAt: new Date().toISOString(),
    };
  }
}

module.exports = DetailScraper;

Challenges & Solutions

1. Request Timeout Tuning

Problem: Some pages load slowly, causing timeouts

Solution: Increase timeout and use domcontentloaded instead of networkidle:

await page.goto(url, {
  waitUntil: 'domcontentloaded', // Faster than networkidle
  timeout: 30000, // 30 seconds
});

2. Anti-Scraping Measures

Problem: Site blocks automated browsers

Solutions:

  • Use realistic user agents
  • Add delays between requests
  • Rotate user agents (for advanced cases)
  • Use residential proxies (if necessary)
const userAgents = [
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
];

const randomUA = userAgents[Math.floor(Math.random() * userAgents.length)];
await browser.newContext({ userAgent: randomUA });

3. Dataset Consistency

Problem: DOM structure changes between pages

Solution: Gracefully handle missing elements:

const getText = (selector) => {
  try {
    const el = document.querySelector(selector);
    return el ? el.textContent.trim() : null;
  } catch {
    return null;
  }
};

Ethical Considerations

Respect robots.txt

Check the site's robots.txt:

https://www.apartments.com/robots.txt

Respect disallowed paths and crawl delays.

Rate Limiting

Don't hammer servers:

// Wait between requests
await this.delay(1000); // 1 second minimum

// Limit concurrent requests
const batchSize = 5; // No more than 5 at once

Terms of Service

Read and comply with the site's ToS. Many sites prohibit automated access. Consider:

  • Using public APIs if available
  • Requesting permission for large-scale scraping
  • Only scraping publicly available data

Conclusion

We've built a production-ready web scraper using Playwright and Node.js that demonstrates:

  • Scalable architecture with parallel execution
  • Robust error handling and retries
  • Flexible filtering for data refinement
  • CSV export for analysis
  • Ethical scraping practices

This Node.js tutorial showcases Playwright's power for automating JavaScript-heavy websites. The patterns here apply to many scraping scenarios beyond apartments: e-commerce price monitoring, job board aggregation, social media data collection, and more.

Key takeaways:

  • Use Playwright for sites that rely on JavaScript
  • Implement parallel execution carefully with rate limiting
  • Always respect robots.txt and ToS
  • Build flexible, maintainable code with clear separation of concerns

Whether you're learning Node.js automation or building production data pipelines, these techniques will serve you well.