Technical brief

How WP Article Cleaner integrates with your WordPress site, what it touches, what it doesn't, and how the moving parts fit together.

1. Overview

WP Article Cleaner is delivered in two layers:

  1. A read-only Python client that talks to a self-hosted WordPress site through its REST API. This is the layer that fetches your content and, eventually, applies approved changes back.
  2. An AI auditing engine that consumes the fetched content and produces structured edit proposals. The engine is a frontier large language model wrapped in an editorial harness we maintain — its internals are intentionally left out of this document.

This brief covers the first layer end-to-end. It is also the layer you'd self-host if you wanted to run the workflow on a fully air-gapped environment.

2. Architecture

The client is intentionally a thin wrapper. There is no broker, no queue, no database. Every interaction is a direct HTTPS call to /wp-json/wp/v2/... on your site, authenticated with HTTP Basic Auth.

┌──────────────────────┐        HTTPS         ┌────────────────────────┐
│  AI auditing engine  │ ───────────────────▶ │  WordPress REST API    │
│  (proprietary)       │   /wp-json/wp/v2/... │  posts · pages · meta  │
└──────────┬───────────┘ ◀─────────────────── └────────────────────────┘
           │ uses                                       ▲
           ▼                                            │ Basic Auth
┌──────────────────────┐                                │ (App Password)
│  wp_cleaner Python   │ ───────────────────────────────┘
│  client library      │
└──────────────────────┘

The Python package wp_cleaner is what the AI engine calls into. It's also what you can call directly from your own scripts if you want to build dashboards, exports, or audits on top of the same primitives.

3. WordPress integration

The client targets the standard WordPress REST API surface. No plugin, no mu-plugin, no theme modifications, and no direct database access are required.

EndpointPurposeUsed by
GET /wp/v2/posts Paginated post listing list_posts()
GET /wp/v2/posts/{id} Single post with raw block markup get_post()

Both calls are issued with the context=edit query parameter. That's important: with context=view, the REST API returns post content as rendered HTML — Gutenberg block delimiters (<!-- wp:paragraph -->) have already been stripped. With context=edit we get content.raw, the source of truth, which is what any sane editing pass needs to operate on.

Pagination

Listing endpoints in WordPress are page-based. The client walks them transparently using the response header X-WP-TotalPages:

def paginated_get(self, path, params=None):
    params = dict(params or {})
    params.setdefault("per_page", 100)
    page = 1
    while True:
        params["page"] = page
        resp = self._request(path, params=params)
        for item in resp.json():
            yield item
        total = int(resp.headers.get("X-WP-TotalPages", "1"))
        if page >= total:
            return
        page += 1

4. Authentication

We use WordPress Application Passwords, a feature built into WordPress 5.6 and later. No third-party plugin is installed.

Application Passwords are scoped credentials a user can generate from Users → Profile → Application Passwords. They are:

Why not OAuth or JWT?

Both options exist, but they require either a server-side OAuth broker or installing a JWT plugin. Application Passwords are native, revocable, and have a smaller attack surface. We default to them and only revisit if a customer explicitly requires SSO.

5. Environment configuration

The client reads three environment variables from the host process. There is no .env file support and no credential file is ever written to disk by the tool.

VariableRequiredNotes
WP_BASE_URL Yes Must start with https://. Trailing slashes are stripped.
WP_USERNAME Yes WordPress login of the user that owns the App Password.
WP_APP_PASSWORD Yes Spaces in the displayed password are stripped automatically.

Set them in the OS environment of the machine running the workflow. On Linux/macOS, that typically means appending to ~/.bashrc or ~/.zshrc; on Windows, using [Environment]::SetEnvironmentVariable(...) at the User scope.

6. The client library

The Python package exposes a small, deliberate surface:

from wp_cleaner import WordPressClient, load_config, list_posts, get_post

cfg = load_config()
client = WordPressClient(cfg.base_url, cfg.username, cfg.app_password)

# Walk every published article (paginated under the hood):
for post in list_posts(client, status="publish"):
    print(post["id"], post["title"]["rendered"])

# Fetch a single post — content.raw preserves Gutenberg block markup:
post = get_post(client, 123)
print(post["content"]["raw"])

That's the entire public API for Stage 1. Anything more — bulk export, content diffing, write-back — composes from those primitives.

7. Error handling

WordPress returns errors as JSON objects of the form {"code": "...", "message": "..."}. The client parses those and raises a typed WPAPIError with the original status code, error code, and human-readable message preserved:

try:
    post = get_post(client, 9999)
except WPAPIError as exc:
    print(exc.status)    # 404
    print(exc.code)      # 'rest_post_invalid_id'
    print(exc.message)   # 'Invalid post ID.'

Configuration errors (missing env vars, non-HTTPS URLs) raise ConfigError with a message that points at the README. Both are intentionally distinct exception types so callers can decide which to surface and which to treat as fatal.

8. Security model

9. Roadmap

Stage 2 introduces the write path and the safety scaffolding that comes with it:

Questions about the integration, security, or self-hosting? Get in touch.