Web scraping is one of the most practical applications of Node.js automation. Whether you're aggregating real estate data, monitoring competitor pricing, or building datasets for analysis, a robust scraper can save hundreds of hours of manual work.
In this Node.js tutorial, we'll build a production-ready apartment listing scraper that demonstrates:
- Playwright automation for JavaScript-heavy sites
- CLI architecture with multiple scraper modes
- Parallel execution for faster data collection
- Anti-scraping countermeasures and timeout tuning
- Data filtering and CSV export for analysis
- Ethical considerations and rate limiting
This isn't a toy example—it's a real scraper I built to aggregate apartment data from Apartments.com for market analysis. The techniques apply to any complex, JavaScript-driven website.
Why Playwright Over Puppeteer or Cheerio?
When building web scrapers in Node.js, you have several options:
| Tool | Use Case | Pros | Cons | |------|----------|------|------| | Cheerio | Static HTML parsing | Fast, lightweight | No JavaScript execution | | Puppeteer | Chrome automation | Mature, good docs | Chrome-only | | Playwright | Cross-browser automation | Multi-browser, better API | Slightly heavier |
For modern sites like Apartments.com that heavily rely on JavaScript for pagination and dynamic content, Playwright is the clear winner:
- ✅ Handles JavaScript-driven pagination
- ✅ Cross-browser support (Chromium, Firefox, WebKit)
- ✅ Better timeout handling and retry logic
- ✅ Auto-waiting for elements (no more
setTimeouthacks) - ✅ Built-in network interception
Project Setup
Initialize Node.js Project
mkdir apartments-scraper && cd apartments-scraper
npm init -y
# Install dependencies
npm install playwright csv-writer yargs dotenv
npm install --save-dev eslint prettier
Project Structure
apartments-scraper/
├── package.json
├── .env
├── cli.js # Entry point
├── src/
│ ├── scrapers/
│ │ ├── list-scraper.js # Crawls listing pages
│ │ ├── detail-scraper.js # Extracts apartment details
│ │ └── base-scraper.js # Shared scraper logic
│ ├── filters/
│ │ └── apartment-filter.js # Filter logic
│ ├── utils/
│ │ ├── csv-exporter.js # CSV export
│ │ └── logger.js # Logging utility
│ └── config.js # Configuration
├── data/
│ └── apartments.csv # Output
└── README.md
Base Scraper: Shared Playwright Logic
First, let's create a base class with common Playwright setup:
// src/scrapers/base-scraper.js
const { chromium } = require('playwright');
const config = require('../config');
class BaseScraper {
constructor() {
this.browser = null;
this.context = null;
}
async initialize() {
console.log('🚀 Launching browser...');
this.browser = await chromium.launch({
headless: config.browser.headless,
slowMo: config.browser.slowMo,
});
// Create context with realistic user agent
this.context = await this.browser.newContext({
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
viewport: { width: 1920, height: 1080 },
});
console.log('✅ Browser ready');
}
async createPage() {
if (!this.context) {
await this.initialize();
}
const page = await this.context.newPage();
// Set default timeout
page.setDefaultTimeout(config.browser.timeout);
// Handle console messages for debugging
page.on('console', (msg) => {
if (msg.type() === 'error') {
console.error(`[Browser Error] ${msg.text()}`);
}
});
return page;
}
async close() {
if (this.browser) {
await this.browser.close();
console.log('🔒 Browser closed');
}
}
async delay(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
module.exports = BaseScraper;
List Scraper: Crawling Pagination
The list scraper navigates through paginated results to collect all apartment URLs:
// src/scrapers/list-scraper.js
const BaseScraper = require('./base-scraper');
const config = require('../config');
class ListScraper extends BaseScraper {
async scrapeListings(baseUrl, maxPages = null) {
const page = await this.createPage();
const apartmentUrls = new Set();
try {
console.log(`📄 Starting list scrape: ${baseUrl}`);
await page.goto(baseUrl, { waitUntil: 'networkidle' });
let currentPage = 1;
while (true) {
console.log(`\\n📑 Scraping page ${currentPage}...`);
// Extract apartment URLs from current page
const urls = await this.extractUrlsFromPage(page);
urls.forEach((url) => apartmentUrls.add(url));
console.log(` Found ${urls.length} listings (${apartmentUrls.size} total)`);
// Check if we've reached max pages
if (maxPages && currentPage >= maxPages) {
console.log(`⏹️ Reached max pages (${maxPages})`);
break;
}
// Try to navigate to next page
const hasNext = await this.goToNextPage(page);
if (!hasNext) {
console.log('✅ No more pages');
break;
}
currentPage++;
// Rate limiting: delay between pages
await this.delay(config.scraper.requestDelay);
}
return Array.from(apartmentUrls);
} catch (error) {
console.error(`❌ List scraping error: ${error.message}`);
throw error;
} finally {
await page.close();
}
}
async extractUrlsFromPage(page) {
// Wait for listings container to load
await page.waitForSelector('[data-testid="property-card"]', {
timeout: 10000,
});
// Extract href from each property card
return page.evaluate(() => {
const cards = Array.from(document.querySelectorAll('[data-testid="property-card"]'));
return cards
.map((card) => {
const link = card.querySelector('a[href*="/apartments/"]');
return link ? link.href : null;
})
.filter(Boolean);
});
}
async goToNextPage(page) {
try {
// Look for "Next" button
const nextButton = await page.$('button[title="Next Page"]');
if (!nextButton) {
return false;
}
// Check if button is disabled
const isDisabled = await nextButton.getAttribute('disabled');
if (isDisabled !== null) {
return false;
}
// **Force click** (important for JavaScript-driven pagination!)
await nextButton.click({ force: true });
// Wait for navigation
await page.waitForLoadState('networkidle');
return true;
} catch (error) {
console.log(` ⚠️ Could not navigate to next page: ${error.message}`);
return false;
}
}
}
module.exports = ListScraper;
Detail Scraper: Extracting Apartment Data
Now the detail scraper visits each apartment page and extracts structured data:
// src/scrapers/detail-scraper.js
const BaseScraper = require('./base-scraper');
const config = require('../config');
class DetailScraper extends BaseScraper {
async scrapeApartments(urls) {
console.log(`\\n🏢 Starting detail scrape for ${urls.length} apartments`);
const apartments = [];
const errors = [];
// **Parallel execution** for speed
const batchSize = config.scraper.maxConcurrent;
for (let i = 0; i < urls.length; i += batchSize) {
const batch = urls.slice(i, i + batchSize);
console.log(`\\n📦 Processing batch ${Math.floor(i / batchSize) + 1}...`);
const results = await Promise.allSettled(
batch.map((url) => this.scrapeApartmentDetail(url))
);
results.forEach((result, index) => {
if (result.status === 'fulfilled' && result.value) {
apartments.push(result.value);
} else {
errors.push({
url: batch[index],
error: result.reason?.message || 'Unknown error',
});
}
});
// Rate limiting between batches
if (i + batchSize < urls.length) {
await this.delay(config.scraper.requestDelay);
}
}
console.log(`\\n✅ Scraped ${apartments.length} apartments`);
if (errors.length > 0) {
console.log(`⚠️ ${errors.length} errors`);
}
return { apartments, errors };
}
async scrapeApartmentDetail(url) {
const page = await this.createPage();
try {
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 20000 });
// Extract data
const data = await page.evaluate(() => {
// Helper functions
const getText = (selector) => {
const el = document.querySelector(selector);
return el ? el.textContent.trim() : null;
};
const getAll = (selector) => {
return Array.from(document.querySelectorAll(selector)).map((el) =>
el.textContent.trim()
);
};
// Extract apartment details
return {
name: getText('h1[data-testid="property-name"]'),
price: getText('[data-testid="price"]'),
beds: getText('[data-testid="bed-count"]'),
baths: getText('[data-testid="bath-count"]'),
sqft: getText('[data-testid="square-feet"]'),
address: getText('[data-testid="property-address"]'),
phone: getText('[data-testid="phone-number"]'),
amenities: getAll('[data-testid="amenity-item"]'),
petPolicy: getText('[data-testid="pet-policy"]'),
neighborhood: getText('[data-testid="neighborhood"]'),
};
});
// Clean and parse data
return this.parseApartmentData(data, url);
} catch (error) {
console.error(` ❌ Error scraping ${url}: ${error.message}`);
throw error;
} finally {
await page.close();
}
}
parseApartmentData(raw, url) {
// Extract numeric price
const priceMatch = raw.price?.match(/\\$([\\d,]+)/);
const price = priceMatch ? parseInt(priceMatch[1].replace(/,/g, '')) : null;
// Extract numeric bedroom count
const bedsMatch = raw.beds?.match(/(\\d+)/);
const beds = bedsMatch ? parseInt(bedsMatch[1]) : null;
// Extract numeric bathroom count
const bathsMatch = raw.baths?.match(/([\\d.]+)/);
const baths = bathsMatch ? parseFloat(bathsMatch[1]) : null;
// Extract numeric square footage
const sqftMatch = raw.sqft?.match(/([\\d,]+)/);
const sqft = sqftMatch ? parseInt(sqftMatch[1].replace(/,/g, '')) : null;
return {
url,
name: raw.name,
price,
beds,
baths,
sqft,
pricePerSqft: price && sqft ? Math.round(price / sqft) : null,
address: raw.address,
neighborhood: raw.neighborhood,
phone: raw.phone,
amenities: raw.amenities?.join(', ') || '',
petPolicy: raw.petPolicy,
scrapedAt: new Date().toISOString(),
};
}
}
module.exports = DetailScraper;
Challenges & Solutions
1. Request Timeout Tuning
Problem: Some pages load slowly, causing timeouts
Solution: Increase timeout and use domcontentloaded instead of networkidle:
await page.goto(url, {
waitUntil: 'domcontentloaded', // Faster than networkidle
timeout: 30000, // 30 seconds
});
2. Anti-Scraping Measures
Problem: Site blocks automated browsers
Solutions:
- Use realistic user agents
- Add delays between requests
- Rotate user agents (for advanced cases)
- Use residential proxies (if necessary)
const userAgents = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
];
const randomUA = userAgents[Math.floor(Math.random() * userAgents.length)];
await browser.newContext({ userAgent: randomUA });
3. Dataset Consistency
Problem: DOM structure changes between pages
Solution: Gracefully handle missing elements:
const getText = (selector) => {
try {
const el = document.querySelector(selector);
return el ? el.textContent.trim() : null;
} catch {
return null;
}
};
Ethical Considerations
Respect robots.txt
Check the site's robots.txt:
https://www.apartments.com/robots.txt
Respect disallowed paths and crawl delays.
Rate Limiting
Don't hammer servers:
// Wait between requests
await this.delay(1000); // 1 second minimum
// Limit concurrent requests
const batchSize = 5; // No more than 5 at once
Terms of Service
Read and comply with the site's ToS. Many sites prohibit automated access. Consider:
- Using public APIs if available
- Requesting permission for large-scale scraping
- Only scraping publicly available data
Conclusion
We've built a production-ready web scraper using Playwright and Node.js that demonstrates:
- Scalable architecture with parallel execution
- Robust error handling and retries
- Flexible filtering for data refinement
- CSV export for analysis
- Ethical scraping practices
This Node.js tutorial showcases Playwright's power for automating JavaScript-heavy websites. The patterns here apply to many scraping scenarios beyond apartments: e-commerce price monitoring, job board aggregation, social media data collection, and more.
Key takeaways:
- Use Playwright for sites that rely on JavaScript
- Implement parallel execution carefully with rate limiting
- Always respect robots.txt and ToS
- Build flexible, maintainable code with clear separation of concerns
Whether you're learning Node.js automation or building production data pipelines, these techniques will serve you well.