Health Check Endpoint
Sunschool includes a built-in health check endpoint for monitoring and load balancer integration.
From server/auth.ts:
app . get ( "/api/healthcheck" , asyncHandler ( async ( req , res ) => {
try {
// Count users for basic DB query test
const result = await db . select ({ count: count () }). from ( users );
const userCount = result [ 0 ]?. count || 0 ;
return res . json ({
status: "ok" ,
db: "connected" ,
userCount
});
} catch ( error ) {
console . error ( "Health check error:" , error );
return res . status ( 500 ). json ({
status: "error" ,
message: "Database connection failed" ,
error: ( error as Error ). message
});
}
}));
Test Health Endpoint
# Check server health
curl http://localhost:5000/api/healthcheck
# Success response:
{
"status" : "ok",
"db" : "connected",
"userCount" : 42
}
# Error response (database down):
{
"status" : "error",
"message" : "Database connection failed",
"error" : "Connection refused"
}
Railway Health Checks
From ENGINEERING.md:
Health check at /api/healthcheck (60s timeout, restart on failure, max 3 retries)
Railway configuration in railway.json:
{
"$schema" : "https://railway.app/railway.schema.json" ,
"build" : {
"builder" : "NIXPACKS"
},
"deploy" : {
"healthcheckPath" : "/api/healthcheck" ,
"healthcheckTimeout" : 60 ,
"restartPolicyType" : "ON_FAILURE" ,
"restartPolicyMaxRetries" : 3
}
}
Health check behavior:
Railway pings /api/healthcheck every 30 seconds
60-second timeout for response
If 3 consecutive failures, container restarts
Zero-downtime during health check failures
Custom Health Checks
Extend the health check with additional diagnostics:
app . get ( "/api/healthcheck" , asyncHandler ( async ( req , res ) => {
const checks = {
status: 'ok' ,
timestamp: new Date (). toISOString (),
uptime: process . uptime (),
memory: process . memoryUsage (),
database: 'unknown' ,
ai_provider: 'unknown'
};
// Database check
try {
await db . select ({ count: count () }). from ( users );
checks . database = 'connected' ;
} catch ( error ) {
checks . database = 'error' ;
checks . status = 'degraded' ;
}
// AI provider check (optional)
if ( OPENROUTER_API_KEY ) {
checks . ai_provider = 'configured' ;
}
const statusCode = checks . status === 'ok' ? 200 : 503 ;
res . status ( statusCode ). json ( checks );
}));
Logging
Server Logs
Sunschool uses console.log for application logging:
// Success logs
console . log ( '[BG] Images generated for lesson' , lessonId );
console . log ( '✅ Migrations applied successfully' );
// Error logs
console . error ( 'Bittensor chat failed:' , error );
console . error ( '⚠️ Migration failed:' , error );
// Warning logs
console . warn ( 'Bittensor requested but not enabled. Falling back to OpenRouter.' );
Log Levels
npm run dev
# Verbose output:
# [Server] Starting on port 5000
# [DB] Connected to postgresql://localhost:5432/sunschool
# [Auth] User login: admin (role: ADMIN)
# [Lesson] Generating lesson for grade 5: Photosynthesis
# [BG] Images generated for lesson abc-123
npm start
# Minimal output:
# Server listening on port 5000
# ✅ Migrations applied
# Error: Database connection lost (automatic retry)
Structured Logging
Not implemented by default. Add structured logging for production monitoring.
Recommended: Winston
import winston from 'winston' ;
const logger = winston . createLogger ({
level: process . env . LOG_LEVEL || 'info' ,
format: winston . format . combine (
winston . format . timestamp (),
winston . format . json ()
),
transports: [
new winston . transports . File ({ filename: 'error.log' , level: 'error' }),
new winston . transports . File ({ filename: 'combined.log' }),
],
});
if ( process . env . NODE_ENV !== 'production' ) {
logger . add ( new winston . transports . Console ({
format: winston . format . simple (),
}));
}
// Usage
logger . info ( 'User login' , { userId: user . id , role: user . role });
logger . error ( 'Database error' , { error: error . message , stack: error . stack });
Railway Logs
Access logs via Railway CLI:
# Stream logs
railway logs
# Filter by service
railway logs --service sunschool
# Show last 100 lines
railway logs --tail 100
# Follow live
railway logs --follow
Log retention:
Railway Free: 7 days
Railway Pro: 30 days
Error Tracking
Server Errors
From server/middleware/auth.ts:
export function asyncHandler ( fn : ( req : Request , res : Response , next : NextFunction ) => Promise < any >) {
return function ( req : Request , res : Response , next : NextFunction ) : Promise < void > {
return Promise
. resolve ( fn ( req , res , next ))
. catch (( error ) => {
console . error ( 'Unhandled API error:' , error );
if ( res . headersSent ) {
return next ( error );
}
// For auth endpoints, provide specific error handling
if ( req . path . includes ( '/api/login' ) || req . path . includes ( '/api/register' )) {
res . status ( 500 ). json ({
error: 'Authentication service error' ,
message: error . message || 'An unexpected error occurred'
});
return ;
}
next ( error );
});
};
}
Production (NODE_ENV=production):
{
"error" : "Failed to generate lesson content"
}
Development:
{
"error" : "Failed to generate lesson content" ,
"details" : "OpenRouter API rate limit exceeded" ,
"stack" : "Error: Rate limit \n at generateLesson (server/services/ai.ts:45:10)"
}
Sentry Integration
Recommended for production. Track errors with context and stack traces.
import * as Sentry from "@sentry/node" ;
Sentry . init ({
dsn: process . env . SENTRY_DSN ,
environment: process . env . NODE_ENV ,
tracesSampleRate: 0.1 ,
});
// Add to error handler
app . use ( Sentry . Handlers . errorHandler ());
// Manual error reporting
try {
await generateLesson ( topic , gradeLevel );
} catch ( error ) {
Sentry . captureException ( error , {
tags: { topic , gradeLevel },
user: { id: req . user . id }
});
throw error ;
}
Database Connection Pooling
From server/db.ts:
export const pool = new Pool ({
connectionString: DATABASE_URL ,
max: 10 , // Maximum connections
idleTimeoutMillis: 30000 ,
connectionTimeoutMillis: 2000 ,
});
// Keep-alive for Neon serverless
setInterval (() => {
pool . query ( 'SELECT 1' );
}, 120000 ); // 2 minutes
Monitor pool stats:
app . get ( '/api/admin/pool-status' , hasRole ([ 'ADMIN' ]), ( req , res ) => {
res . json ({
totalCount: pool . totalCount ,
idleCount: pool . idleCount ,
waitingCount: pool . waitingCount ,
});
});
Response Times
Track endpoint performance:
import responseTime from 'response-time' ;
app . use ( responseTime (( req , res , time ) => {
if ( time > 1000 ) { // Log slow requests
console . warn ( `Slow request: ${ req . method } ${ req . url } took ${ time } ms` );
}
// Optional: Send to metrics service
// metrics.timing('http.response_time', time, { path: req.path });
}));
Memory Usage
app . get ( '/api/admin/metrics' , hasRole ([ 'ADMIN' ]), ( req , res ) => {
const usage = process . memoryUsage ();
res . json ({
rss: ` ${ Math . round ( usage . rss / 1024 / 1024 ) } MB` ,
heapTotal: ` ${ Math . round ( usage . heapTotal / 1024 / 1024 ) } MB` ,
heapUsed: ` ${ Math . round ( usage . heapUsed / 1024 / 1024 ) } MB` ,
external: ` ${ Math . round ( usage . external / 1024 / 1024 ) } MB` ,
uptime: ` ${ Math . round ( process . uptime () / 60 ) } minutes` ,
});
});
Application Metrics
User Activity
From server/services/activity-service.ts (if implemented):
// Track active users
const activeUsers = new Set ();
app . use ( isAuthenticated , ( req , res , next ) => {
if ( req . user ) {
activeUsers . add ( req . user . id );
setTimeout (() => activeUsers . delete ( req . user . id ), 300000 ); // 5 min
}
next ();
});
// Metrics endpoint
app . get ( '/api/admin/metrics/users' , hasRole ([ 'ADMIN' ]), ( req , res ) => {
res . json ({
activeUsers: activeUsers . size ,
timestamp: new Date (). toISOString (),
});
});
Lesson Generation Metrics
let lessonStats = {
total: 0 ,
successful: 0 ,
failed: 0 ,
avgDuration: 0 ,
};
// Track in lesson generation
const startTime = Date . now ();
try {
const lesson = await generateLessonWithRetry ( ... );
lessonStats . total ++ ;
lessonStats . successful ++ ;
lessonStats . avgDuration =
( lessonStats . avgDuration * ( lessonStats . total - 1 ) + ( Date . now () - startTime )) / lessonStats . total ;
} catch ( error ) {
lessonStats . total ++ ;
lessonStats . failed ++ ;
throw error ;
}
Alerting
Railway Alerts
Configure alerts in Railway dashboard:
Health check failures - Email/Slack on 3+ consecutive failures
High memory usage - Alert when > 90% of allocated memory
Deployment failures - Notify on failed builds
Database connection issues - Alert on connection pool exhaustion
Custom Alerts
Email on critical errors:
import nodemailer from 'nodemailer' ;
const transporter = nodemailer . createTransport ({
host: process . env . SMTP_HOST ,
port: 587 ,
auth: {
user: process . env . SMTP_USER ,
pass: process . env . SMTP_PASS ,
},
});
async function sendAlert ( subject : string , message : string ) {
if ( process . env . NODE_ENV !== 'production' ) return ;
await transporter . sendMail ({
from: 'alerts@sunschool.xyz' ,
to: process . env . ADMIN_EMAIL ,
subject: `[Sunschool Alert] ${ subject } ` ,
text: message ,
});
}
// Usage
try {
await migrate ( db , { migrationsFolder: './drizzle/migrations' });
} catch ( error ) {
await sendAlert ( 'Migration Failed' , error . message );
throw error ;
}
Slack Integration
npm install @slack/webhook
import { IncomingWebhook } from '@slack/webhook' ;
const webhook = new IncomingWebhook ( process . env . SLACK_WEBHOOK_URL );
async function notifySlack ( message : string , level : 'info' | 'warning' | 'error' ) {
const colors = { info: '#36a64f' , warning: '#ff9900' , error: '#ff0000' };
await webhook . send ({
attachments: [{
color: colors [ level ],
title: 'Sunschool Alert' ,
text: message ,
ts: Math . floor ( Date . now () / 1000 ). toString (),
}]
});
}
// Usage
await notifySlack ( 'Database migration completed successfully' , 'info' );
await notifySlack ( 'OpenRouter API rate limit reached' , 'warning' );
await notifySlack ( 'Critical: Database connection lost' , 'error' );
Monitoring Dashboard
Grafana + Prometheus
Datadog
New Relic
Full-featured monitoring stack: # docker-compose.yml
version : '3'
services :
prometheus :
image : prom/prometheus
ports :
- 9090:9090
volumes :
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana :
image : grafana/grafana
ports :
- 3000:3000
environment :
- GF_SECURITY_ADMIN_PASSWORD=admin
Metrics:
Request rates and latencies
Database connection pool
Memory and CPU usage
Custom business metrics
Managed APM solution: // Must be first import
import 'dd-trace/init' ;
// Automatic instrumentation for:
// - HTTP requests
// - Database queries
// - Error tracking
// - Performance profiling
Application performance monitoring: // Add to top of server/index.ts
require ( 'newrelic' );
// Auto-instruments:
// - Express routes
// - PostgreSQL queries
// - External API calls
Custom Metrics Endpoint
app . get ( '/api/admin/metrics' , hasRole ([ 'ADMIN' ]), async ( req , res ) => {
const metrics = {
server: {
uptime: process . uptime (),
memory: process . memoryUsage (),
cpu: process . cpuUsage (),
},
database: {
poolSize: pool . totalCount ,
idleConnections: pool . idleCount ,
waitingClients: pool . waitingCount ,
},
users: {
total: await db . select ({ count: count () }). from ( users ),
active: activeUsers . size ,
},
lessons: {
generated: lessonStats . total ,
successful: lessonStats . successful ,
failed: lessonStats . failed ,
avgDuration: lessonStats . avgDuration ,
},
};
res . json ( metrics );
});
Troubleshooting with Logs
Common Log Patterns
Log pattern: ⚠️ Migration failed (server continues): Error: column "new_field" already exists
Diagnosis: Migration partially applied or run twice.Fix: Check drizzle_migrations table, manually reconcile schema.
Database connection errors
Log pattern: Health check error: Connection refused
Error: ECONNREFUSED postgresql://localhost:5432
Diagnosis: PostgreSQL not running or wrong DATABASE_URL.Fix: Verify database status and connection string.
Log pattern: Bittensor chat failed: Error: timeout after 10000ms
Falling back to OpenRouter
Diagnosis: Bittensor unavailable, automatic fallback working.Action: Monitor fallback frequency, consider disabling Bittensor if unreliable.
Log pattern: <--- Last few GCs --->
[12345:0x...] 180000 ms: Mark-sweep 1500.0 (1600.0) -> 1450.0 (1600.0) MB
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit
Diagnosis: Memory leak or insufficient memory allocation.Fix: Increase memory limit (NODE_OPTIONS=--max-old-space-size=2048) or fix leak.
Best Practices
Enable Health Checks
Configure load balancer or orchestrator to poll /api/healthcheck
Structured Logging
Use Winston or Pino for JSON logs with context
Error Tracking
Integrate Sentry for production error monitoring
Performance Metrics
Track response times, database queries, AI generation duration
Alerting
Set up alerts for critical errors, health check failures, high resource usage
Log Retention
Archive logs for compliance (30+ days recommended)
Next Steps