ml
How to Deploy a Large Language Model from Hugging Face to AWS: Part 3 — API Gateway Setup
Exposing Your Lambda Function as a Public HTTP Endpoint
This is the final part of the series. In Part 1 we deployed a Hugging Face LLM to SageMaker, and in Part 2 we wired a Lambda function to it. Here we’ll put an HTTP API in front of that Lambda using API Gateway, giving the outside world a clean endpoint to send prompts to.
Step 1: Create an HTTP API
Go to API Gateway → APIs → Create API. When prompted to choose an API type, select HTTP API and click Build.
HTTP API is the right choice here — it’s faster and cheaper than REST API, and the feature set is sufficient for a single-route inference endpoint.
Step 2: Add the Lambda Integration
On the integrations screen, configure the following:
| Setting | Value |
|---|---|
| Integration type | Lambda function |
| Lambda function | generate-text-lamini |
| Add permissions to Lambda function | ✓ Checked |
Checking Add permissions to Lambda function lets API Gateway automatically configure the resource-based policy needed to invoke your Lambda — one less manual step. Click Next.
Step 3: Configure the Route
Define a single route for inference requests:
| Setting | Value |
|---|---|
| Method | POST |
| Resource path | /generate |
Click Next.
Step 4: Configure the Stage
Leave the stage name as $default and ensure Auto-deploy is enabled. Auto-deploy means any future changes to the API are published immediately without a manual deployment step.
Click Next → Create.
Step 5: Retrieve the Invoke URL
After creation you’ll land on the API dashboard. The Invoke URL is displayed at the top and follows this format:
https://your-api-id.execute-api.us-east-2.amazonaws.com
Append /generate to get your full endpoint:
https://your-api-id.execute-api.us-east-2.amazonaws.com/generate
Save this URL — it’s the address you’ll hit to run inference from any external client.
Step 6: Test the Endpoint
Before wiring this into anything else, verify the full pipeline end-to-end. API Gateway has a built-in test console: go to APIs → your API → Routes → POST /generate → Test and send the following body:
{
"prompt": "Write a poem about the ocean."
}
A successful response confirms that API Gateway is correctly routing to Lambda, and Lambda is correctly invoking the SageMaker endpoint.
If the test times out, double-check that your Lambda timeout is set to at least 60 seconds (covered in Part 2). If you receive a 403, verify that the Lambda resource-based policy includes an Allow for apigateway.amazonaws.com — this should have been set automatically in Step 2, but it’s worth confirming under Lambda → Configuration → Permissions.
Step 7: Call the API from Python
With the endpoint live, you can query your model from any HTTP client. Here’s a minimal Python example:
import requests
url = "https://your-api-id.execute-api.us-east-2.amazonaws.com/generate"
response = requests.post(url, json={"prompt": "Explain what a transformer model is."})
print(response.json())
Replace the URL with your actual Invoke URL from Step 5. The json= parameter in requests.post handles serialization and sets the correct Content-Type header automatically.
Monitoring
Logs for both API Gateway and Lambda are written to CloudWatch Logs automatically. If a request succeeds at the API layer but returns an unexpected result, the Lambda log group (/aws/lambda/generate-text-lamini) is the first place to look.
Summary
The three-part pipeline is now complete:
Client → API Gateway (POST /generate) → Lambda → SageMaker Endpoint → Model
You have a public HTTPS endpoint that accepts a JSON prompt, routes it through a serverless Lambda function, invokes a GPU-backed SageMaker endpoint, and returns the model’s response — with no persistent compute running when the API is idle.
From Part 1: IAM execution role + SageMaker GPU endpoint serving LaMini-T5-738M
From Part 2: Lambda function with least-privilege IAM role and a 60-second timeout
From Part 3: HTTP API Gateway with auto-deploy routing POST /generate to Lambda