Detection Pipeline
1
Fetch Page
HTML + HTTP headers
→
2
SPA Detection
JS-rendered shell check
→
3
Technographics
35 built-in + 4,300 webappanalyzer
→
4
Traffic & Domain
CDN, analytics, domain quality
→
5
Employee Footprint
ATS, hiring, team size
→
6
Scrape About Page
/about, /our-story, etc.
→
7
Semantic / NLP
60+ keywords, links, structure
→
8
Score & Classify
Net score → bucket
Scoring System
Every detected signal earns points in a direction (SMB or Enterprise) at a confidence level. Points are summed per category, then each category is capped at 20 pts to prevent any single signal type from dominating.
| Confidence | Points | When to use |
| high | 10 pts | Definitive platform fingerprint or unambiguous keyword (e.g., Wix detected via hostname, AEM detected via /content/dam/) |
| medium | 5 pts | Strong indicator but not conclusive alone (e.g., premium short .com domain, OneTrust cookie banner) |
| low | 2 pts | Weak or corroborating signal (e.g., single physical address, toll-free phone number) |
The net score (SMB pts − Enterprise pts) determines the final bucket:
| Net Score | Bucket | Confidence level |
| ≥ +30 | Likely SMB | high |
| +15 to +29 | Likely SMB | medium |
| +10 to +14 | Likely SMB | low |
| −9 to +9 | Indeterminate | — |
| −10 to −14 | Likely Enterprise | low |
| −15 to −29 | Likely Enterprise | medium |
| ≤ −30 | Likely Enterprise | high |
1. Technographics
The tech stack is the single strongest indicator. Detection uses two layers: 35 hand-tuned built-in fingerprints matched against HTTP headers, HTML content, script sources, meta tags, cookies, and hostnames; plus 4,337 signatures from the webappanalyzer database (the open-source successor to Wappalyzer). SPA shell detection also runs here.
SMB platforms 15 signals
| Platform | Points | Detection method |
| Wix | +10 | Hostname *.wixsite.com, Pepyaka server header, wix-thunderbolt in HTML, meta generator |
| Squarespace | +10 | Squarespace server header, SQUARESPACE_ROLLUPS, sqs-block CSS class, CDN hostname |
| Weebly | +10 | Hostname *.weebly.com, wsite- CSS classes, editmysite.com CDN |
| GoDaddy Website Builder | +10 | Meta generator, img.secureserver.net CDN |
| Blogger / Blogspot | +10 | Hostname *.blogspot.com |
| Carrd | +10 | Hostname *.carrd.co, carrd.co in HTML |
| Duda | +10 | dm-root- CSS, dmcdn.net CDN |
| Square Online | +10 | Hostname *.square.site, squarecdn.com CDN |
| Jimdo | +10 | Hostname *.jimdosite.com, meta generator |
| Webflow | +5 | Hostname *.webflow.io, data-wf-site, website-files.com CDN |
| WordPress.com (hosted) | +5 | Hostname *.wordpress.com, s0.wp.com CDN, meta generator |
| Shopify (Basic/Standard) | +5 | powered-by: Shopify header, x-shopid header, cdn.shopify.com |
| Yelp Badge / Widget | +5 | yelp.com/biz/ link in HTML |
| Mailchimp (Marketing) | +2 | chimpstatic.com, list-manage.com in scripts |
| Constant Contact | +2 | constantcontact.com in HTML |
Enterprise platforms 20 signals
| Platform | Points | Detection method |
| Adobe Experience Manager (AEM) | −10 | /content/dam/ paths, cq-component, adobeaemcloud.com, x-dispatcher header |
| Sitecore | −10 | sc_site= in HTML, SC_ANALYTICS cookie, layouts/system |
| Salesforce Commerce Cloud | −10 | demandware.static CDN, dwvar_ params, dwfrm_ forms |
| SAP Commerce (Hybris) | −10 | hybris, acceleratorstorefront, ycommerce in HTML |
| Oracle Commerce | −10 | /atg/rest/ API paths, endeca.com |
| Episerver / Optimizely CMS | −10 | episerver, epi-contentarea in HTML, meta generator |
| Marketo (Marketing Automation) | −10 | marketo.net scripts, munchkin tracker, mkto in HTML |
| Eloqua (Oracle) | −10 | eloqua.com scripts, elqCFG, elqSiteId globals |
| Pardot (Salesforce) | −10 | pardot.com scripts, piAId, piCId globals |
| Tealium (Tag Management) | −10 | tags.tiqcdn.com script, utag.js, tealium in HTML |
| Adobe Analytics / Launch | −10 | assets.adobedtm.com, omtrdc.net, AppMeasurement, s_code |
| Shopify Plus | −10 | shopify-plus in HTML, checkout.shopifycs.com, x-shopify-stage header |
| Akamai CDN | −10 | AkamaiGHost server header, x-akamai-transformed header, akamaized.net CDN |
| Segment | −5 | cdn.segment.com script, analytics.js |
| OneTrust / Cookie Consent | −5 | cdn.cookielaw.org, onetrust.com scripts, optanon in HTML |
| HubSpot Enterprise | −5 | js.hs-scripts.com, hsforms.net, hs-script-loader |
| Amazon CloudFront | −5 | CloudFront server header, x-amz-cf-id header |
| Fastly CDN | −5 | varnish via header, x-served-by: cache-*, x-fastly-request-id |
| Magento / Adobe Commerce | −5 | requirejs-config.js, varien/js, meta generator |
| Drupal (Enterprise CMS) | −5 | Drupal. JS global, x-drupal-cache header, sites/default/files |
SPA / JS-rendered shell 1 signal
| Signal | Points | Detection logic |
| SPA / JS-rendered application | −5 | HTML <50KB + (JS bundle scripts OR id="root"/id="__next" + React/Angular/Vue markers) + fewer than 15 links. Prevents false SMB classification of enterprise SPAs (Amazon, Target, etc.) |
Webappanalyzer layer 4,337 technology signatures
The enthec/webappanalyzer database (open-source successor to Wappalyzer) is matched against the same fetched HTML and headers. Each detected technology is mapped to a signal using name-based and category-based lookup tables. A sample of the high-signal technologies:
| Technology | Points | Category |
| Google Tag Manager | −5 | Tag managers |
| OneTrust | −5 | Cookie compliance |
| Segment | −5 | Analytics |
| Akamai mPulse | −2 | RUM / Analytics |
| Bazaarvoice Reviews | −5 | Reviews (enterprise) |
| Amazon S3 | −5 | CDN / IaaS |
| Greenhouse / Lever / Workday | −5 | Recruitment |
| Squarespace Commerce | +5 | Ecommerce (SMB) |
| Calendly / Acuity | +2 | Appointment scheduling |
| Tawk.to / Tidio / Crisp | +5 | Live chat (SMB tier) |
| + thousands more in categories: CMS, Ecommerce, Analytics, Marketing automation, CDN, Tag managers, Live chat, CRM, SEO, Hosting, A/B testing, Page builders, and more |
2. Traffic & Domain
Proxies for web scale without direct traffic data. Note: absence of enterprise signals is not treated as evidence of SMB, since SPA sites load analytics via JavaScript that isn't visible in raw HTML.
Domain quality signals 5 signals
| Signal | Points | Logic |
| Ultra-premium .com (1–4 chars) | −10 | ibm.com, hp.com, ford.com, visa.com — these domains are worth millions |
| Premium short .com (5–6 chars) | −5 | apple.com, intel.com, cisco.com, adobe.com |
| Single-word .com (≤10 chars) | −2 | amazon.com, target.com, walmart.com, starbucks.com — clean brand domains |
| Very long domain (>20 chars) | +2 | SMBs often can't secure short names, e.g. bestplumbingserviceinatlanta.com |
| Multi-hyphenated domain (2+ hyphens) | +2 | Local businesses often register joes-pizza-brooklyn.com |
Hosting & CDN signals 4 signals
| Signal | Points | Logic |
| Free platform subdomain | +10 | *.wixsite.com, *.squarespace.com, *.myshopify.com, *.wordpress.com, etc. — no custom domain is a strong SMB signal |
| Country-code TLD | +2 | .us, .uk, .ca, .au, etc. with no subdomain — regional focus common for SMBs |
| Multi-language support | −5 | hreflang tags or language switcher — internationalization requires significant investment |
| Multiple analytics platforms (3+) | −5 | Layered analytics stacks (GTM + Adobe + Segment + etc.) suggest enterprise marketing budget |
Page complexity signals 2 signals
| Signal | Points | Logic |
| High link density (>150 links) | −5 | Large site structures with many navigation links suggest enterprise-scale content |
| Large page size (>500KB HTML) | −2 | Very heavy HTML pages suggest enterprise-level inline content. Small pages are NOT treated as SMB signals since SPA shells are intentionally tiny. |
Chat & payment signals ~8 signals
| Signal | Points | Tool |
| Intercom, Drift, Qualified, 6sense chat | −2 | Enterprise-tier conversational platforms, typically $1k+/mo |
| Tawk.to, Tidio, Crisp, Olark chat | +2 | Free/affordable live chat tools popular with SMBs |
| PayPal (without Braintree) | +2 | Basic PayPal integration (not Braintree/advanced) typical of SMBs |
| Stripe integration | +2 | Stripe is popular with SMBs and startups |
3. Employee Footprint
Hiring infrastructure and team language are strong proxies for company size. LinkedIn API is not used — instead, on-page signals are analyzed.
All employee footprint signals 6 signal types
| Signal | Points | Logic |
| Enterprise ATS detected | −10 | Greenhouse, Lever, Workday, Oracle Taleo, iCIMS, SmartRecruiters, Jobvite, BrassRing, myworkdayjobs.com |
| Informal hiring language | +5 | "Now hiring", "join our team", "apply in person", "send your resume to [email]" — no ATS detected |
| Small team section (≤10 members) | +5 | "Meet the team" section with ≤10 visible member cards |
| Large team section (>50 members) | −5 | Team section with 50+ employee cards |
| Employee count ≤50 mentioned | +10 | Explicit mention like "our team of 12" — firmly SMB |
| Employee count 51–500 mentioned | +5 | Mid-range employee count, likely SMB or mid-market |
| Employee count 500+ mentioned | −10 | Explicit large employee count |
| 5+ global cities mentioned | −5 | NY, London, Tokyo, Singapore, etc. — global office presence |
4. Semantic / NLP
Raw HTML text (minus scripts and styles) is scanned for keyword patterns across the homepage and any discovered About page. The About page is fetched separately from paths: /about, /about-us, /our-story, /who-we-are, /company, /company/about.
SMB keywords — homepage 32 patterns
| Keyword / Pattern | Points |
| "family-owned", "family-run", "family business" | +10 each |
| "locally-owned", "small business", "local business" | +10 each |
| "small team", "owner-operated", "home-based" | +10 each |
| "mom-and-pop", "independently-owned" | +10 each |
| "woman-owned", "veteran-owned", "minority-owned" | +10 each |
| "licensed and insured" | +10 |
| "BBB" / "Better Business Bureau" | +5 each |
| "boutique", "neighborhood", "our small", "our shop" | +5 each |
| "serving the [X] area", "serving [X] and surrounding" | +5 each |
| "free estimate", "free quote", "proudly serving", "fully insured" | +5 each |
| "since [year]", "established [year]", "call us today", "get a quote", "our founder" | +2 each |
Enterprise keywords — homepage 29 patterns
| Keyword / Pattern | Points |
| "global headquarters", "corporate headquarters" | −10 each |
| "investor relations", "annual report", "10-K", "SEC filing" | −10 each |
| "subsidiary/subsidiaries", "global presence", "global offices", "board of directors" | −10 each |
| "corporate governance", Fortune ranking, stock exchange (NASDAQ/NYSE) | −10 each |
| "IPO", "ESG", "corporate social responsibility", "newsroom" | −5 each |
| "shareholder(s)", "global presence", C-suite titles, "diversity & inclusion" | −5 each |
| "M&A" / "acquisition of" / "merged with" | −5 each |
| "worldwide", "enterprise", "countries" reference, "compliance", "regulatory" | −2 each |
| "careers", "press release", employee count mentioned | −2 each |
Navigation link patterns 14 patterns
| Link path | Points |
/investor-relations | −10 |
/governance, /annual-report, /global-offices | −10 each |
/newsroom, /press-releases, /sustainability, /media-center | −5 each |
/careers | −2 |
/free-quote, /free-estimate, /service-area | +5 each |
/our-story, /testimonials | +2 each |
About page NLP 26 SMB + 17 enterprise patterns
About pages are scraped separately because they contain the richest self-description. Additional SMB patterns run on About pages only:
| Pattern | Points |
| "started in my garage/kitchen/basement" | +10 |
| "my husband/wife/partner and I", "passion project", "black-owned" | +10 each |
| First-person founder narrative ("I started…") | +5 |
| Brief About page (<100 words) | +2 |
| Extensive About page (>1,000 words) | −2 |
| "100+ countries", thousands of employees on About page | −10 each |
Address & phone signals 4 signals
| Signal | Points | Logic |
| Single physical address | +5 | Exactly one street address found — single-location business |
| Multiple physical addresses (3+) | −5 | Three or more street addresses indicate multi-location operation |
| Store locator / "find a location" | −10 | Store locator UI implies enterprise-scale retail presence |
| Local phone number only (no toll-free) | +2 | Single local area code with no 800/888/877 number |
| Toll-free phone number | −2 | 800/888/877/866 numbers common for larger national businesses |
Social & legal signals ~6 signals
| Signal | Points | Logic |
| Yelp business link | +5 | yelp.com/biz/ link on page — strongly associated with local SMBs |
| Nextdoor link | +5 | Nextdoor neighborhood platform used by local businesses |
| Minimal social presence (1–2 platforms) | +2 | Few social links suggest small marketing team |
| Extensive social presence (5+ platforms) | −2 | Broad multi-channel presence suggests dedicated marketing team |
| Corporate entity in copyright | −5 | Copyright with Inc., Corp., Group, Holdings, PLC, AG, GmbH |
| 3+ legal/compliance pages | −5 | Privacy policy + ToS + cookie policy + accessibility statement |