Technical brief
How WP Article Cleaner integrates with your WordPress site, what it touches, what it doesn't, and how the moving parts fit together.
1. Overview
WP Article Cleaner is delivered in two layers:
- A read-only Python client that talks to a self-hosted WordPress site through its REST API. This is the layer that fetches your content and, eventually, applies approved changes back.
- An AI auditing engine that consumes the fetched content and produces structured edit proposals. The engine is a frontier large language model wrapped in an editorial harness we maintain — its internals are intentionally left out of this document.
This brief covers the first layer end-to-end. It is also the layer you'd self-host if you wanted to run the workflow on a fully air-gapped environment.
2. Architecture
The client is intentionally a thin wrapper. There is no broker, no
queue, no database. Every interaction is a direct HTTPS call to
/wp-json/wp/v2/... on your site, authenticated with
HTTP Basic Auth.
┌──────────────────────┐ HTTPS ┌────────────────────────┐
│ AI auditing engine │ ───────────────────▶ │ WordPress REST API │
│ (proprietary) │ /wp-json/wp/v2/... │ posts · pages · meta │
└──────────┬───────────┘ ◀─────────────────── └────────────────────────┘
│ uses ▲
▼ │ Basic Auth
┌──────────────────────┐ │ (App Password)
│ wp_cleaner Python │ ───────────────────────────────┘
│ client library │
└──────────────────────┘
The Python package wp_cleaner is what the AI engine
calls into. It's also what you can call directly from your own
scripts if you want to build dashboards, exports, or audits on top
of the same primitives.
3. WordPress integration
The client targets the standard WordPress REST API surface. No
plugin, no mu-plugin, no theme modifications, and no
direct database access are required.
| Endpoint | Purpose | Used by |
|---|---|---|
GET /wp/v2/posts |
Paginated post listing | list_posts() |
GET /wp/v2/posts/{id} |
Single post with raw block markup | get_post() |
Both calls are issued with the context=edit query
parameter. That's important: with context=view, the
REST API returns post content as rendered HTML — Gutenberg
block delimiters (<!-- wp:paragraph -->) have
already been stripped. With context=edit we get
content.raw, the source of truth, which is what any
sane editing pass needs to operate on.
Pagination
Listing endpoints in WordPress are page-based. The client walks
them transparently using the response header
X-WP-TotalPages:
def paginated_get(self, path, params=None):
params = dict(params or {})
params.setdefault("per_page", 100)
page = 1
while True:
params["page"] = page
resp = self._request(path, params=params)
for item in resp.json():
yield item
total = int(resp.headers.get("X-WP-TotalPages", "1"))
if page >= total:
return
page += 1
4. Authentication
We use WordPress Application Passwords, a feature built into WordPress 5.6 and later. No third-party plugin is installed.
Application Passwords are scoped credentials a user can generate from Users → Profile → Application Passwords. They are:
- Per-application. You can issue one for WP Article Cleaner specifically and revoke it without touching your real password.
- HTTPS-only. The credential travels as HTTP Basic Auth over TLS. The client refuses non-HTTPS base URLs.
- Visible exactly once. WordPress shows the generated string a single time; it is never retrievable afterwards.
Both options exist, but they require either a server-side OAuth broker or installing a JWT plugin. Application Passwords are native, revocable, and have a smaller attack surface. We default to them and only revisit if a customer explicitly requires SSO.
5. Environment configuration
The client reads three environment variables from the host
process. There is no .env file support and no
credential file is ever written to disk by the tool.
| Variable | Required | Notes |
|---|---|---|
WP_BASE_URL |
Yes | Must start with https://. Trailing slashes are stripped. |
WP_USERNAME |
Yes | WordPress login of the user that owns the App Password. |
WP_APP_PASSWORD |
Yes | Spaces in the displayed password are stripped automatically. |
Set them in the OS environment of the machine running the
workflow. On Linux/macOS, that typically means appending to
~/.bashrc or ~/.zshrc; on Windows, using
[Environment]::SetEnvironmentVariable(...) at the
User scope.
6. The client library
The Python package exposes a small, deliberate surface:
from wp_cleaner import WordPressClient, load_config, list_posts, get_post
cfg = load_config()
client = WordPressClient(cfg.base_url, cfg.username, cfg.app_password)
# Walk every published article (paginated under the hood):
for post in list_posts(client, status="publish"):
print(post["id"], post["title"]["rendered"])
# Fetch a single post — content.raw preserves Gutenberg block markup:
post = get_post(client, 123)
print(post["content"]["raw"])
That's the entire public API for Stage 1. Anything more — bulk export, content diffing, write-back — composes from those primitives.
7. Error handling
WordPress returns errors as JSON objects of the form
{"code": "...", "message": "..."}. The client parses
those and raises a typed WPAPIError with the original
status code, error code, and human-readable message preserved:
try:
post = get_post(client, 9999)
except WPAPIError as exc:
print(exc.status) # 404
print(exc.code) # 'rest_post_invalid_id'
print(exc.message) # 'Invalid post ID.'
Configuration errors (missing env vars, non-HTTPS URLs) raise
ConfigError with a message that points at the README.
Both are intentionally distinct exception types so callers can
decide which to surface and which to treat as fatal.
8. Security model
-
HTTPS-only transport. The client refuses to
start if
WP_BASE_URLis not HTTPS. There is no way to disable this check. -
No credential persistence by the tool.
Credentials are only ever read from
os.environat runtime. Nothing is cached, logged, or written to disk. - Read-only at Stage 1. The currently shipped client cannot create, update, or delete WordPress content. Even compromise of the credential cannot mutate your site through this codebase.
- Scoped credential. Application Passwords inherit the role of the WordPress user that issued them. We recommend issuing them under an Editor-role user, not under an Administrator account.
- Auditable surface. The entire client is open source and short enough to read in one sitting. Every HTTP call the AI engine makes flows through this code.
9. Roadmap
Stage 2 introduces the write path and the safety scaffolding that comes with it:
-
Update support via
POST /wp/v2/posts/{id}with diff preview and dry-run modes. - Local snapshot store. Every article gets a JSON snapshot written before it is mutated, enabling exact rollback.
- Page-builder detection. Posts authored in Elementor, Divi, or similar visual builders will be detected and refused — their content lives in custom meta, not in the post body, so REST edits would be invisible.
- Draft-only mode. Approved updates can land as drafts so a human reviews the rendered article inside WordPress before publishing.
- Pages, taxonomies, and media. The same client surface, extended to the rest of the WordPress content model.
Questions about the integration, security, or self-hosting? Get in touch.