Playbook: Build a Shared Toolbox for Your Organization
Every company with internal systems has the same quiet problem right now. A finance analyst opens an AI client, pastes in some API docs, and gets code that pulls invoices out of NetSuite. Two days later somebody in ops does the same thing with a slightly different prompt and ends up with a subtly different query. A week later someone on the support team writes their own version to look up customer credit. Six months in, eight different people have eight different bespoke scripts querying the same three systems, each one with a credential hardcoded somewhere nobody wants to admit, each one breaking on a different day when the underlying API changes.
The individual AI experience is great. The organizational outcome is a mess.
This playbook walks you through the alternative: a curated, governed toolbox that your whole company draws from. You build the tools once. They are tested, versioned, and observable. Every subagent anyone builds pulls from the same shelf. Credentials live in one place. When the API changes, one person updates one tool and everyone downstream is fixed at once.
This is not a use-case playbook. The other playbooks in this section answer "what can I build?" This one answers "how do I set up my workspace so my team can build safely?"
Who this playbook is for
- A platform engineer, internal-tools champion, or ops architect at a company with real internal systems — a WMS, an ERP, a DSP, a homegrown order system, a ticketing platform.
- Somebody whose job is to make the rest of the company productive, not to personally answer every "can you look up X for me" question.
- Probably a small group of one to three people who will own the toolbox long-term.
What you will build
By the end of this playbook, you will have:
- A first batch of five to eight tools that cover the most common read questions your team asks about one specific internal system.
- Workspace-level credentials wired in once, not pasted into anyone's code.
- A published current version of each tool that subagents can draw from immediately.
- A sample subagent using the toolbox end-to-end, so teammates have a reference to copy.
- A simple governance pattern — who owns the tools, how versions roll out, how to deprecate.
- An answer to the question "how do we prevent people from hand-rolling their own version of this?" — because they no longer have to.
Why a toolbox beats bespoke code
Before we get into the how, it helps to be concrete about what the toolbox actually buys you over the alternative.
One credential path, not many. A tool declares the secrets and OAuth tokens it needs. The workspace injects those as environment variables at run time. Tool authors never see the value, and nobody pastes it into a chat, a file, or a .env on their laptop.
A schema at the door. Every tool declares the shape of its inputs. Callers cannot pass arbitrary text — parameters are validated before the tool runs. This is the single cheapest way to stop hallucinated calls.
A secure runtime. Tools execute in a sandboxed environment. The filesystem is restricted. Network access is explicitly whitelisted per tool (only the domains the tool needs). There is no way for a tool to read the machine it runs on, reach into another tool's state, or call out to a domain you did not allow.
Immutable versions. A tool has drafts, committed snapshots, and one current version. Draft changes do not affect anything until you publish. The "published" version is what every subagent actually calls. When you fix a bug or change behavior, you publish a new version, and every caller picks it up on the next run — without anyone having to update their own code.
Complete observability. Every tool run — who invoked it, what the input was, what it returned, how long it took — is logged and viewable in one place. You get the audit trail without having to build it.
None of this is novel in platform engineering. What is novel is that any of your teammates can now build an AI agent that uses those tools. The toolbox makes the "can we let AI touch our internal systems" question answerable: yes, through these specific tools, with these specific permissions, logged here.
Step 1: Pick the first system and the first eight questions
Pick one internal system. Not two. One. The right candidate is whichever system is causing the most "can you look up X for me" interruptions in your team today. A warehouse management system, an order management system, or a CRM is usually the right first target. Accounting systems are a good second target. Save anything that has a write surface for later — start read-only.
Now look at the last two weeks of messages in whichever Slack channel that team lives in and list the top questions. Group them. Aim for five to eight distinct questions, each one a single API call away from an answer:
- "Where is order X?"
- "Do we have SKU X in stock at warehouse Y?"
- "What's the credit limit on account Z?"
- "When did order X ship, and what's the tracking number?"
- "Which orders did we ship yesterday to zone Z?"
Each one of these becomes one tool. Keep tools narrow — one tool per endpoint, not one tool that does six things. Narrow tools are easier to test, easier to reason about, and easier to deprecate when the underlying API changes.
Step 2: Wire the credential once
Go to the connections area of your workspace. If the system uses OAuth (Slack, Google, Salesforce), set up the OAuth connection. If it uses an API key or bearer token, add it as a workspace secret. Whichever path, you only do this once, and only you (or whoever is managing the toolbox) ever sees the actual value.
Give the credential the narrowest possible scope. If your WMS supports creating a read-only API user, create one specifically for Assist. If you can only make a token that does everything, create it but only grant the specific tools that need it. The point of scoping at the workspace level is that the tools inherit exactly what they are allowed to touch — no more, no less.
Step 3: Build the first tool
Open /tools/new. Pick a runtime (TypeScript is the fastest path; other runtimes are available). You are going to fill in four things:
A name and description. Make the description specific. "Look up an order by ID in the warehouse management system. Returns status, line items, ship date, tracking number." This description is what the AI reads when it decides whether this tool is the right one for a question. A vague description is how useful tools get skipped in favor of hallucinated code.
The handler. The contract is short: you write a function that receives an event and a context. The event.data object is the validated input — whatever parameters the caller passed. The event.env object contains the secrets and tokens the tool is allowed to see. You return a JSON-serializable object. Whatever you return is what the caller gets back.
In practice your tool is a handful of lines: build a URL from event.data, add an auth header from event.env, call fetch, shape the response, return it.
The input schema. Declare the parameter shape. order_id is a required string. warehouse_id is an optional string. A schema editor is built into the page — you do not need to write JSON by hand unless you want to. The schema is what stops a caller from passing the wrong thing.
The permissions. Declare which environment variables the tool needs (pointing at the credential you wired in step 2) and which network domains it is allowed to reach. This is the guardrail: a tool cannot make calls you did not authorize.
Save a draft.
Step 4: Test from the edit page
On the tool's edit page there is a test panel. Paste a real input. Click run. You see the return value, the logs, the duration, and any error. You are not calling from a subagent yet — the tool stands on its own, which means you can develop it fully before any agent ever sees it.
Iterate until the tool is right. Common things you will catch:
- The API returns a slightly different shape than the docs suggest.
- A field you assumed was a number is actually a string.
- The credential has less scope than you expected.
Fix each one in the draft, re-test, move on.
Step 5: Commit and publish
The versioning model rewards deliberateness. When the draft is ready:
- Commit the draft. The committed version is immutable — a permanent snapshot you can always roll back to.
- Publish the committed version. Publishing sets it as the current version, which is what subagents actually call.
Now the tool is live. Anyone in the workspace can add it to a subagent.
When you need to change the tool later, you edit a new draft, test, commit, publish. Drafts do not affect anything running in production. A teammate's subagent is never reading a half-finished version of a tool — it is always pinned to whichever version was current at the moment of the call.
Step 6: Repeat for the rest of the first batch
Do the remaining four to seven tools the same way. The second one takes half as long as the first. By the fifth you have a template. A batch of eight tools covering one system is a weekend of work, not a quarter.
Resist the urge to add tools that cross systems. Each tool should hit one endpoint, on one system. Composition happens at the subagent level, not at the tool level.
Step 7: Build a reference subagent
Create one subagent yourself that uses the new tools end-to-end. The ops channel concierge playbook is a good template — a Slack-routed agent that lives in the channel where your team already asks these questions.
Set the agent's tool whitelist to exactly the tools you built in this batch. Nothing else. That is the guardrail: this agent can only call these tools, which can only touch these domains, which can only use these credentials. The whole path is inspectable.
Run it for a week. Watch the tool-history page. Look for:
- Tools that get called a lot. These are working.
- Tools that never get called. Either the agent's prompt isn't describing them well, or the question does not actually come up — in which case delete the tool, because an unused tool is a maintenance liability.
- Tools that fail. Either the API changed, the credential expired, or the input schema is too loose. Fix and publish a new version.
Step 8: Open it up to the team
Now announce the toolbox. Tell the team:
We now have a curated set of tools for querying the WMS. They are in the workspace under the 'WMS' tool pack. When you build a subagent that needs to answer WMS questions, pick from these tools — they are tested, they handle credentials correctly, and every run is logged. Please don't hand-roll your own. If you need a WMS tool that doesn't exist yet, ping me — I'd rather build one and have everyone benefit than see five versions of the same query.
Point them at the reference subagent you built. Point them at adding tools to a subagent as the how-to. The shelf is open.
Step 9: Govern the toolbox
Three lightweight practices that pay off fast:
Ownership. One person (or a small group) owns the toolbox for each system. They are the ones who review proposed new tools, commit and publish versions, and decide when to deprecate. Without a clear owner the toolbox stagnates and the "hand-rolled bespoke code" pattern comes back.
Deprecation with a deadline. When a tool is replaced, mark the old one as deprecated, give it a hard retirement date, and audit tool-history to see which subagents still call it. Message the owners of those subagents before you pull the plug. The clean version control is what makes this easy — you can see exactly who's on the old version.
Review calls periodically. Once a quarter, skim tool-history. Look for tools that get called a lot but fail often — those need hardening. Look for subagents making a lot of tool calls with identical inputs — those queries should probably be cached or combined into a single composite tool. Look for tools nobody uses — delete them.
Step 10: Expand to the next system
Once the first system's toolbox is settled and the team is using it, pick the next internal system and do it again. A quarter later you have two governed toolboxes; a year later you have four. The marginal cost of each new subagent anyone on the team builds keeps dropping because they keep picking from a bigger, sharper shelf.
What you built
You have a workspace where:
- The tools that touch your internal systems are written once, by the right people, with the right scope.
- Credentials live in the workspace, not in someone's local environment or checked into a repo.
- Every tool invocation is schema-validated, sandboxed, and logged.
- A broken change cannot silently hit production — drafts, commits, and publishes exist precisely so you can be deliberate about rollout.
- Any teammate building a subagent picks from a trusted shelf. They do not have to write code. They do not have to handle credentials. They just pick the tools their agent needs.
What used to be eight bespoke scripts in eight different places is now one versioned tool. What used to be "we can't let ops use AI against the WMS, it's too risky" is now "ops uses the WMS toolbox, here's the log of everything they did last week."
Where to go from here
- Write tools. Once the read surface is covered, add write tools carefully — one at a time, with the tightest possible scope, and with obvious naming (
erp_create_vendor, noterp_vendor_action). - Composite tools. When you notice the same three reads always happening together, write one tool that does all three and returns the combined shape. Fewer tool calls, less tokens, simpler subagent prompts.
- Domain-specific specialist subagents. Once you have a full toolbox for a system, build a specialist subagent whose entire job is to answer questions about that system. See the tier-1 support triage playbook for the specialist pattern.
- Cross-team sharing. If another team builds their own toolbox in a separate workspace, consider consolidating — the toolbox compounds across teams the same way it compounds across a team.