Skip to main content

Overview

CrewAI integrates with multiple LLM providers through providers native sdks, giving you the flexibility to choose the right model for your specific use case. This guide will help you understand how to configure and use different LLM providers in your CrewAI projects.

What are LLMs?

Large Language Models (LLMs) are the core intelligence behind CrewAI agents. They enable agents to understand context, make decisions, and generate human-like responses. Here’s what you need to know:

LLM Basics

Large Language Models are AI systems trained on vast amounts of text data. They power the intelligence of your CrewAI agents, enabling them to understand and generate human-like text.

Context Window

The context window determines how much text an LLM can process at once. Larger windows (e.g., 128K tokens) allow for more context but may be more expensive and slower.

Temperature

Temperature (0.0 to 1.0) controls response randomness. Lower values (e.g., 0.2) produce more focused, deterministic outputs, while higher values (e.g., 0.8) increase creativity and variability.

Provider Selection

Each LLM provider (e.g., OpenAI, Anthropic, Google) offers different models with varying capabilities, pricing, and features. Choose based on your needs for accuracy, speed, and cost.

Setting up your LLM

There are different places in CrewAI code where you can specify the model to use. Once you specify the model you are using, you will need to provide the configuration (like an API key) for each of the model providers you use. See the provider configuration examples section for your provider.
  • 1. Environment Variables
  • 2. YAML Configuration
  • 3. Direct Code
The simplest way to get started. Set the model in your environment directly, through an .env file or in your app code. If you used crewai create to bootstrap your project, it will be set already.
.env
MODEL=model-id  # e.g. gpt-4o, gemini-2.0-flash, claude-3-sonnet-...

# Be sure to set your API keys here too. See the Provider
# section below.
Never commit API keys to version control. Use environment files (.env) or your system’s secret management.

Provider Configuration Examples

CrewAI supports a multitude of LLM providers, each offering unique features, authentication methods, and model capabilities. In this section, you’ll find detailed examples that help you select, configure, and optimize the LLM that best fits your project’s needs.
CrewAI provides native integration with OpenAI through the OpenAI Python SDK.
Code
# Required
OPENAI_API_KEY=sk-...

# Optional
OPENAI_BASE_URL=<custom-base-url>
Basic Usage:
Code
from crewai import LLM

llm = LLM(
    model="openai/gpt-4o",
    api_key="your-api-key",  # Or set OPENAI_API_KEY
    temperature=0.7,
    max_tokens=4000
)
Advanced Configuration:
Code
from crewai import LLM

llm = LLM(
    model="openai/gpt-4o",
    api_key="your-api-key",
    base_url="https://api.openai.com/v1",  # Optional custom endpoint
    organization="org-...",  # Optional organization ID
    project="proj_...",  # Optional project ID
    temperature=0.7,
    max_tokens=4000,
    max_completion_tokens=4000,  # For newer models
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.1,
    stop=["END"],
    seed=42,  # For reproducible outputs
    stream=True,  # Enable streaming
    timeout=60.0,  # Request timeout in seconds
    max_retries=3,  # Maximum retry attempts
    logprobs=True,  # Return log probabilities
    top_logprobs=5,  # Number of most likely tokens
    reasoning_effort="medium"  # For o1 models: low, medium, high
)
Structured Outputs:
Code
from pydantic import BaseModel
from crewai import LLM

class ResponseFormat(BaseModel):
    name: str
    age: int
    summary: str

llm = LLM(
    model="openai/gpt-4o",
)
Supported Environment Variables:
  • OPENAI_API_KEY: Your OpenAI API key (required)
  • OPENAI_BASE_URL: Custom base URL for OpenAI API (optional)
Features:
  • Native function calling support (except o1 models)
  • Structured outputs with JSON schema
  • Streaming support for real-time responses
  • Token usage tracking
  • Stop sequences support (except o1 models)
  • Log probabilities for token-level insights
  • Reasoning effort control for o1 models
Supported Models:
ModelContext WindowBest For
gpt-4.11M tokensLatest model with enhanced capabilities
gpt-4.1-mini1M tokensEfficient version with large context
gpt-4.1-nano1M tokensUltra-efficient variant
gpt-4o128,000 tokensOptimized for speed and intelligence
gpt-4o-mini200,000 tokensCost-effective with large context
gpt-4-turbo128,000 tokensLong-form content, document analysis
gpt-48,192 tokensHigh-accuracy tasks, complex reasoning
o1200,000 tokensAdvanced reasoning, complex problem-solving
o1-preview128,000 tokensPreview of reasoning capabilities
o1-mini128,000 tokensEfficient reasoning model
o3-mini200,000 tokensLightweight reasoning model
o4-mini200,000 tokensNext-gen efficient reasoning
Note: To use OpenAI, install the required dependencies:
uv add "crewai[openai]"
Meta’s Llama API provides access to Meta’s family of large language models. The API is available through the Meta Llama API. Set the following environment variables in your .env file:
Code
# Meta Llama API Key Configuration
LLAMA_API_KEY=LLM|your_api_key_here
Example usage in your CrewAI project:
Code
from crewai import LLM

# Initialize Meta Llama LLM
llm = LLM(
    model="meta_llama/Llama-4-Scout-17B-16E-Instruct-FP8",
    temperature=0.8,
    stop=["END"],
    seed=42
)
All models listed here https://llama.developer.meta.com/docs/models/ are supported.
Model IDInput context lengthOutput context lengthInput ModalitiesOutput Modalities
meta_llama/Llama-4-Scout-17B-16E-Instruct-FP8128k4028Text, ImageText
meta_llama/Llama-4-Maverick-17B-128E-Instruct-FP8128k4028Text, ImageText
meta_llama/Llama-3.3-70B-Instruct128k4028TextText
meta_llama/Llama-3.3-8B-Instruct128k4028TextText
CrewAI provides native integration with Anthropic through the Anthropic Python SDK.
Code
# Required
ANTHROPIC_API_KEY=sk-ant-...
Basic Usage:
Code
from crewai import LLM

llm = LLM(
    model="anthropic/claude-3-5-sonnet-20241022",
    api_key="your-api-key",  # Or set ANTHROPIC_API_KEY
    max_tokens=4096  # Required for Anthropic
)
Advanced Configuration:
Code
from crewai import LLM

llm = LLM(
    model="anthropic/claude-3-5-sonnet-20241022",
    api_key="your-api-key",
    base_url="https://api.anthropic.com",  # Optional custom endpoint
    temperature=0.7,
    max_tokens=4096,  # Required parameter
    top_p=0.9,
    stop_sequences=["END", "STOP"],  # Anthropic uses stop_sequences
    stream=True,  # Enable streaming
    timeout=60.0,  # Request timeout in seconds
    max_retries=3  # Maximum retry attempts
)
Supported Environment Variables:
  • ANTHROPIC_API_KEY: Your Anthropic API key (required)
Features:
  • Native tool use support for Claude 3+ models
  • Streaming support for real-time responses
  • Automatic system message handling
  • Stop sequences for controlled output
  • Token usage tracking
  • Multi-turn tool use conversations
Important Notes:
  • max_tokens is a required parameter for all Anthropic models
  • Claude uses stop_sequences instead of stop
  • System messages are handled separately from conversation messages
  • First message must be from the user (automatically handled)
  • Messages must alternate between user and assistant
Supported Models:
ModelContext WindowBest For
claude-3-7-sonnet200,000 tokensAdvanced reasoning and agentic tasks
claude-3-5-sonnet-20241022200,000 tokensLatest Sonnet with best performance
claude-3-5-haiku200,000 tokensFast, compact model for quick responses
claude-3-opus200,000 tokensMost capable for complex tasks
claude-3-sonnet200,000 tokensBalanced intelligence and speed
claude-3-haiku200,000 tokensFastest for simple tasks
claude-2.1200,000 tokensExtended context, reduced hallucinations
claude-2100,000 tokensVersatile model for various tasks
claude-instant100,000 tokensFast, cost-effective for everyday tasks
Note: To use Anthropic, install the required dependencies:
uv add "crewai[anthropic]"
CrewAI provides native integration with Google Gemini through the Google Gen AI Python SDK.Set your API key in your .env file. If you need a key, check AI Studio.
.env
# Required (one of the following)
GOOGLE_API_KEY=<your-api-key>
GEMINI_API_KEY=<your-api-key>

# Optional - for Vertex AI
GOOGLE_CLOUD_PROJECT=<your-project-id>
GOOGLE_CLOUD_LOCATION=<location>  # Defaults to us-central1
GOOGLE_GENAI_USE_VERTEXAI=true  # Set to use Vertex AI
Basic Usage:
Code
from crewai import LLM

llm = LLM(
    model="gemini/gemini-2.0-flash",
    api_key="your-api-key",  # Or set GOOGLE_API_KEY/GEMINI_API_KEY
    temperature=0.7
)
Advanced Configuration:
Code
from crewai import LLM

llm = LLM(
    model="gemini/gemini-2.5-flash",
    api_key="your-api-key",
    temperature=0.7,
    top_p=0.9,
    top_k=40,  # Top-k sampling parameter
    max_output_tokens=8192,
    stop_sequences=["END", "STOP"],
    stream=True,  # Enable streaming
    safety_settings={
        "HARM_CATEGORY_HARASSMENT": "BLOCK_NONE",
        "HARM_CATEGORY_HATE_SPEECH": "BLOCK_NONE"
    }
)
Vertex AI Configuration:
Code
from crewai import LLM

llm = LLM(
    model="gemini/gemini-1.5-pro",
    project="your-gcp-project-id",
    location="us-central1"  # GCP region
)
Supported Environment Variables:
  • GOOGLE_API_KEY or GEMINI_API_KEY: Your Google API key (required for Gemini API)
  • GOOGLE_CLOUD_PROJECT: Google Cloud project ID (for Vertex AI)
  • GOOGLE_CLOUD_LOCATION: GCP location (defaults to us-central1)
  • GOOGLE_GENAI_USE_VERTEXAI: Set to true to use Vertex AI
Features:
  • Native function calling support for Gemini 1.5+ and 2.x models
  • Streaming support for real-time responses
  • Multimodal capabilities (text, images, video)
  • Safety settings configuration
  • Support for both Gemini API and Vertex AI
  • Automatic system instruction handling
  • Token usage tracking
Gemini Models:Google offers a range of powerful models optimized for different use cases.
ModelContext WindowBest For
gemini-2.5-flash1M tokensAdaptive thinking, cost efficiency
gemini-2.5-pro1M tokensEnhanced thinking and reasoning, multimodal understanding
gemini-2.0-flash1M tokensNext generation features, speed, thinking
gemini-2.0-flash-thinking32,768 tokensAdvanced reasoning with thinking process
gemini-2.0-flash-lite1M tokensCost efficiency and low latency
gemini-1.5-pro2M tokensBest performing, logical reasoning, coding
gemini-1.5-flash1M tokensBalanced multimodal model, good for most tasks
gemini-1.5-flash-8b1M tokensFastest, most cost-efficient
gemini-1.0-pro32,768 tokensEarlier generation model
Gemma Models:The Gemini API also supports Gemma models hosted on Google infrastructure.
ModelContext WindowBest For
gemma-3-1b32,000 tokensUltra-lightweight tasks
gemma-3-4b128,000 tokensEfficient general-purpose tasks
gemma-3-12b128,000 tokensBalanced performance and efficiency
gemma-3-27b128,000 tokensHigh-performance tasks
Note: To use Google Gemini, install the required dependencies:
uv add "crewai[google-genai]"
The full list of models is available in the Gemini model docs.
Get credentials from your Google Cloud Console and save it to a JSON file, then load it with the following code:
Code
import json

file_path = 'path/to/vertex_ai_service_account.json'

# Load the JSON file
with open(file_path, 'r') as file:
    vertex_credentials = json.load(file)

# Convert the credentials to a JSON string
vertex_credentials_json = json.dumps(vertex_credentials)
Example usage in your CrewAI project:
Code
from crewai import LLM

llm = LLM(
    model="gemini-1.5-pro-latest", # or vertex_ai/gemini-1.5-pro-latest
    temperature=0.7,
    vertex_credentials=vertex_credentials_json
)
Google offers a range of powerful models optimized for different use cases:
ModelContext WindowBest For
gemini-2.5-flash-preview-04-171M tokensAdaptive thinking, cost efficiency
gemini-2.5-pro-preview-05-061M tokensEnhanced thinking and reasoning, multimodal understanding, advanced coding, and more
gemini-2.0-flash1M tokensNext generation features, speed, thinking, and realtime streaming
gemini-2.0-flash-lite1M tokensCost efficiency and low latency
gemini-1.5-flash1M tokensBalanced multimodal model, good for most tasks
gemini-1.5-flash-8B1M tokensFastest, most cost-efficient, good for high-frequency tasks
gemini-1.5-pro2M tokensBest performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration
CrewAI provides native integration with Azure AI Inference and Azure OpenAI through the Azure AI Inference Python SDK.
Code
# Required
AZURE_API_KEY=<your-api-key>
AZURE_ENDPOINT=<your-endpoint-url>

# Optional
AZURE_API_VERSION=<api-version>  # Defaults to 2024-06-01
Endpoint URL Formats:For Azure OpenAI deployments:
https://<resource-name>.openai.azure.com/openai/deployments/<deployment-name>
For Azure AI Inference endpoints:
https://<resource-name>.inference.azure.com
Basic Usage:
Code
llm = LLM(
    model="azure/gpt-4",
    api_key="<your-api-key>",  # Or set AZURE_API_KEY
    endpoint="<your-endpoint-url>",
    api_version="2024-06-01"
)
Advanced Configuration:
Code
llm = LLM(
    model="azure/gpt-4o",
    temperature=0.7,
    max_tokens=4000,
    top_p=0.9,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    stop=["END"],
    stream=True,
    timeout=60.0,
    max_retries=3
)
Supported Environment Variables:
  • AZURE_API_KEY: Your Azure API key (required)
  • AZURE_ENDPOINT: Your Azure endpoint URL (required, also checks AZURE_OPENAI_ENDPOINT and AZURE_API_BASE)
  • AZURE_API_VERSION: API version (optional, defaults to 2024-06-01)
Features:
  • Native function calling support for Azure OpenAI models (gpt-4, gpt-4o, gpt-3.5-turbo, etc.)
  • Streaming support for real-time responses
  • Automatic endpoint URL validation and correction
  • Comprehensive error handling with retry logic
  • Token usage tracking
Note: To use Azure AI Inference, install the required dependencies:
uv add "crewai[azure-ai-inference]"
CrewAI provides native integration with AWS Bedrock through the boto3 SDK using the Converse API.
Code
# Required
AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>

# Optional
AWS_SESSION_TOKEN=<your-session-token>  # For temporary credentials
AWS_DEFAULT_REGION=<your-region>  # Defaults to us-east-1
Basic Usage:
Code
from crewai import LLM

llm = LLM(
    model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
    region_name="us-east-1"
)
Advanced Configuration:
Code
from crewai import LLM

llm = LLM(
    model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
    aws_access_key_id="your-access-key",  # Or set AWS_ACCESS_KEY_ID
    aws_secret_access_key="your-secret-key",  # Or set AWS_SECRET_ACCESS_KEY
    aws_session_token="your-session-token",  # For temporary credentials
    region_name="us-east-1",
    temperature=0.7,
    max_tokens=4096,
    top_p=0.9,
    top_k=250,  # For Claude models
    stop_sequences=["END", "STOP"],
    stream=True,  # Enable streaming
    guardrail_config={  # Optional content filtering
        "guardrailIdentifier": "your-guardrail-id",
        "guardrailVersion": "1"
    },
    additional_model_request_fields={  # Model-specific parameters
        "top_k": 250
    }
)
Supported Environment Variables:
  • AWS_ACCESS_KEY_ID: AWS access key (required)
  • AWS_SECRET_ACCESS_KEY: AWS secret key (required)
  • AWS_SESSION_TOKEN: AWS session token for temporary credentials (optional)
  • AWS_DEFAULT_REGION: AWS region (defaults to us-east-1)
Features:
  • Native tool calling support via Converse API
  • Streaming and non-streaming responses
  • Comprehensive error handling with retry logic
  • Guardrail configuration for content filtering
  • Model-specific parameters via additional_model_request_fields
  • Token usage tracking and stop reason logging
  • Support for all Bedrock foundation models
  • Automatic conversation format handling
Important Notes:
  • Uses the modern Converse API for unified model access
  • Automatic handling of model-specific conversation requirements
  • System messages are handled separately from conversation
  • First message must be from user (automatically handled)
  • Some models (like Cohere) require conversation to end with user message
Amazon Bedrock is a managed service that provides access to multiple foundation models from top AI companies through a unified API.
ModelContext WindowBest For
Amazon Nova ProUp to 300k tokensHigh-performance, model balancing accuracy, speed, and cost-effectiveness across diverse tasks.
Amazon Nova MicroUp to 128k tokensHigh-performance, cost-effective text-only model optimized for lowest latency responses.
Amazon Nova LiteUp to 300k tokensHigh-performance, affordable multimodal processing for images, video, and text with real-time capabilities.
Claude 3.7 SonnetUp to 128k tokensHigh-performance, best for complex reasoning, coding & AI agents
Claude 3.5 Sonnet v2Up to 200k tokensState-of-the-art model specialized in software engineering, agentic capabilities, and computer interaction at optimized cost.
Claude 3.5 SonnetUp to 200k tokensHigh-performance model delivering superior intelligence and reasoning across diverse tasks with optimal speed-cost balance.
Claude 3.5 HaikuUp to 200k tokensFast, compact multimodal model optimized for quick responses and seamless human-like interactions
Claude 3 SonnetUp to 200k tokensMultimodal model balancing intelligence and speed for high-volume deployments.
Claude 3 HaikuUp to 200k tokensCompact, high-speed multimodal model optimized for quick responses and natural conversational interactions
Claude 3 OpusUp to 200k tokensMost advanced multimodal model exceling at complex tasks with human-like reasoning and superior contextual understanding.
Claude 2.1Up to 200k tokensEnhanced version with expanded context window, improved reliability, and reduced hallucinations for long-form and RAG applications
ClaudeUp to 100k tokensVersatile model excelling in sophisticated dialogue, creative content, and precise instruction following.
Claude InstantUp to 100k tokensFast, cost-effective model for everyday tasks like dialogue, analysis, summarization, and document Q&A
Llama 3.1 405B InstructUp to 128k tokensAdvanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks.
Llama 3.1 70B InstructUp to 128k tokensPowers complex conversations with superior contextual understanding, reasoning and text generation.
Llama 3.1 8B InstructUp to 128k tokensAdvanced state-of-the-art model with language understanding, superior reasoning, and text generation.
Llama 3 70B InstructUp to 8k tokensPowers complex conversations with superior contextual understanding, reasoning and text generation.
Llama 3 8B InstructUp to 8k tokensAdvanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
Titan Text G1 - LiteUp to 4k tokensLightweight, cost-effective model optimized for English tasks and fine-tuning with focus on summarization and content generation.
Titan Text G1 - ExpressUp to 8k tokensVersatile model for general language tasks, chat, and RAG applications with support for English and 100+ languages.
Cohere CommandUp to 4k tokensModel specialized in following user commands and delivering practical enterprise solutions.
Jurassic-2 MidUp to 8,191 tokensCost-effective model balancing quality and affordability for diverse language tasks like Q&A, summarization, and content generation.
Jurassic-2 UltraUp to 8,191 tokensModel for advanced text generation and comprehension, excelling in complex tasks like analysis and content creation.
Jamba-InstructUp to 256k tokensModel with extended context window optimized for cost-effective text generation, summarization, and Q&A.
Mistral 7B InstructUp to 32k tokensThis LLM follows instructions, completes requests, and generates creative text.
Mistral 8x7B InstructUp to 32k tokensAn MOE LLM that follows instructions, completes requests, and generates creative text.
DeepSeek R132,768 tokensAdvanced reasoning model
Note: To use AWS Bedrock, install the required dependencies:
uv add "crewai[bedrock]"
Code
AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>
AWS_DEFAULT_REGION=<your-region>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="sagemaker/<my-endpoint>"
)
Set the following environment variables in your .env file:
Code
MISTRAL_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="mistral/mistral-large-latest",
    temperature=0.7
)
Set the following environment variables in your .env file:
Code
NVIDIA_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="nvidia_nim/meta/llama3-70b-instruct",
    temperature=0.7
)
Nvidia NIM provides a comprehensive suite of models for various use cases, from general-purpose tasks to specialized applications.
ModelContext WindowBest For
nvidia/mistral-nemo-minitron-8b-8k-instruct8,192 tokensState-of-the-art small language model delivering superior accuracy for chatbot, virtual assistants, and content generation.
nvidia/nemotron-4-mini-hindi-4b-instruct4,096 tokensA bilingual Hindi-English SLM for on-device inference, tailored specifically for Hindi Language.
nvidia/llama-3.1-nemotron-70b-instruct128k tokensCustomized for enhanced helpfulness in responses
nvidia/llama3-chatqa-1.5-8b128k tokensAdvanced LLM to generate high-quality, context-aware responses for chatbots and search engines.
nvidia/llama3-chatqa-1.5-70b128k tokensAdvanced LLM to generate high-quality, context-aware responses for chatbots and search engines.
nvidia/vila128k tokensMulti-modal vision-language model that understands text/img/video and creates informative responses
nvidia/neva-224,096 tokensMulti-modal vision-language model that understands text/images and generates informative responses
nvidia/nemotron-mini-4b-instruct8,192 tokensGeneral-purpose tasks
nvidia/usdcode-llama3-70b-instruct128k tokensState-of-the-art LLM that answers OpenUSD knowledge queries and generates USD-Python code.
nvidia/nemotron-4-340b-instruct4,096 tokensCreates diverse synthetic data that mimics the characteristics of real-world data.
meta/codellama-70b100k tokensLLM capable of generating code from natural language and vice versa.
meta/llama2-70b4,096 tokensCutting-edge large language AI model capable of generating text and code in response to prompts.
meta/llama3-8b-instruct8,192 tokensAdvanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
meta/llama3-70b-instruct8,192 tokensPowers complex conversations with superior contextual understanding, reasoning and text generation.
meta/llama-3.1-8b-instruct128k tokensAdvanced state-of-the-art model with language understanding, superior reasoning, and text generation.
meta/llama-3.1-70b-instruct128k tokensPowers complex conversations with superior contextual understanding, reasoning and text generation.
meta/llama-3.1-405b-instruct128k tokensAdvanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks.
meta/llama-3.2-1b-instruct128k tokensAdvanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
meta/llama-3.2-3b-instruct128k tokensAdvanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
meta/llama-3.2-11b-vision-instruct128k tokensAdvanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
meta/llama-3.2-90b-vision-instruct128k tokensAdvanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
google/gemma-7b8,192 tokensCutting-edge text generation model text understanding, transformation, and code generation.
google/gemma-2b8,192 tokensCutting-edge text generation model text understanding, transformation, and code generation.
google/codegemma-7b8,192 tokensCutting-edge model built on Google’s Gemma-7B specialized for code generation and code completion.
google/codegemma-1.1-7b8,192 tokensAdvanced programming model for code generation, completion, reasoning, and instruction following.
google/recurrentgemma-2b8,192 tokensNovel recurrent architecture based language model for faster inference when generating long sequences.
google/gemma-2-9b-it8,192 tokensCutting-edge text generation model text understanding, transformation, and code generation.
google/gemma-2-27b-it8,192 tokensCutting-edge text generation model text understanding, transformation, and code generation.
google/gemma-2-2b-it8,192 tokensCutting-edge text generation model text understanding, transformation, and code generation.
google/deplot512 tokensOne-shot visual language understanding model that translates images of plots into tables.
google/paligemma8,192 tokensVision language model adept at comprehending text and visual inputs to produce informative responses.
mistralai/mistral-7b-instruct-v0.232k tokensThis LLM follows instructions, completes requests, and generates creative text.
mistralai/mixtral-8x7b-instruct-v0.18,192 tokensAn MOE LLM that follows instructions, completes requests, and generates creative text.
mistralai/mistral-large4,096 tokensCreates diverse synthetic data that mimics the characteristics of real-world data.
mistralai/mixtral-8x22b-instruct-v0.18,192 tokensCreates diverse synthetic data that mimics the characteristics of real-world data.
mistralai/mistral-7b-instruct-v0.332k tokensThis LLM follows instructions, completes requests, and generates creative text.
nv-mistralai/mistral-nemo-12b-instruct128k tokensMost advanced language model for reasoning, code, multilingual tasks; runs on a single GPU.
mistralai/mamba-codestral-7b-v0.1256k tokensModel for writing and interacting with code across a wide range of programming languages and tasks.
microsoft/phi-3-mini-128k-instruct128K tokensLightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-mini-4k-instruct4,096 tokensLightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-small-8k-instruct8,192 tokensLightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-small-128k-instruct128K tokensLightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-medium-4k-instruct4,096 tokensLightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-medium-128k-instruct128K tokensLightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3.5-mini-instruct128K tokensLightweight multilingual LLM powering AI applications in latency bound, memory/compute constrained environments
microsoft/phi-3.5-moe-instruct128K tokensAdvanced LLM based on Mixture of Experts architecture to deliver compute efficient content generation
microsoft/kosmos-21,024 tokensGroundbreaking multimodal model designed to understand and reason about visual elements in images.
microsoft/phi-3-vision-128k-instruct128k tokensCutting-edge open multimodal model exceling in high-quality reasoning from images.
microsoft/phi-3.5-vision-instruct128k tokensCutting-edge open multimodal model exceling in high-quality reasoning from images.
databricks/dbrx-instruct12k tokensA general-purpose LLM with state-of-the-art performance in language understanding, coding, and RAG.
snowflake/arctic1,024 tokensDelivers high efficiency inference for enterprise applications focused on SQL generation and coding.
aisingapore/sea-lion-7b-instruct4,096 tokensLLM to represent and serve the linguistic and cultural diversity of Southeast Asia
ibm/granite-8b-code-instruct4,096 tokensSoftware programming LLM for code generation, completion, explanation, and multi-turn conversion.
ibm/granite-34b-code-instruct8,192 tokensSoftware programming LLM for code generation, completion, explanation, and multi-turn conversion.
ibm/granite-3.0-8b-instruct4,096 tokensAdvanced Small Language Model supporting RAG, summarization, classification, code, and agentic AI
ibm/granite-3.0-3b-a800m-instruct4,096 tokensHighly efficient Mixture of Experts model for RAG, summarization, entity extraction, and classification
mediatek/breeze-7b-instruct4,096 tokensCreates diverse synthetic data that mimics the characteristics of real-world data.
upstage/solar-10.7b-instruct4,096 tokensExcels in NLP tasks, particularly in instruction-following, reasoning, and mathematics.
writer/palmyra-med-70b-32k32k tokensLeading LLM for accurate, contextually relevant responses in the medical domain.
writer/palmyra-med-70b32k tokensLeading LLM for accurate, contextually relevant responses in the medical domain.
writer/palmyra-fin-70b-32k32k tokensSpecialized LLM for financial analysis, reporting, and data processing
01-ai/yi-large32k tokensPowerful model trained on English and Chinese for diverse tasks including chatbot and creative writing.
deepseek-ai/deepseek-coder-6.7b-instruct2k tokensPowerful coding model offering advanced capabilities in code generation, completion, and infilling
rakuten/rakutenai-7b-instruct1,024 tokensAdvanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
rakuten/rakutenai-7b-chat1,024 tokensAdvanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
baichuan-inc/baichuan2-13b-chat4,096 tokensSupport Chinese and English chat, coding, math, instruction following, solving quizzes
NVIDIA NIM enables you to run powerful LLMs locally on your Windows machine using WSL2 (Windows Subsystem for Linux). This approach allows you to leverage your NVIDIA GPU for private, secure, and cost-effective AI inference without relying on cloud services. Perfect for development, testing, or production scenarios where data privacy or offline capabilities are required.Here is a step-by-step guide to setting up a local NVIDIA NIM model:
  1. Follow installation instructions from NVIDIA Website
  2. Install the local model. For Llama 3.1-8b follow instructions
  3. Configure your crewai local models:
Code
from crewai.llm import LLM

local_nvidia_nim_llm = LLM(
    model="openai/meta/llama-3.1-8b-instruct", # it's an openai-api compatible model
    base_url="http://localhost:8000/v1",
    api_key="<your_api_key|any text if you have not configured it>", # api_key is required, but you can use any text
)

# Then you can use it in your crew:

@CrewBase
class MyCrew():
    # ...

    @agent
    def researcher(self) -> Agent:
        return Agent(
            config=self.agents_config['researcher'], # type: ignore[index]
            llm=local_nvidia_nim_llm
        )

    # ...
Set the following environment variables in your .env file:
Code
GROQ_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="groq/llama-3.2-90b-text-preview",
    temperature=0.7
)
ModelContext WindowBest For
Llama 3.1 70B/8B131,072 tokensHigh-performance, large context tasks
Llama 3.2 Series8,192 tokensGeneral-purpose tasks
Mixtral 8x7B32,768 tokensBalanced performance and context
Set the following environment variables in your .env file:
Code
# Required
WATSONX_URL=<your-url>
WATSONX_APIKEY=<your-apikey>
WATSONX_PROJECT_ID=<your-project-id>

# Optional
WATSONX_TOKEN=<your-token>
WATSONX_DEPLOYMENT_SPACE_ID=<your-space-id>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="watsonx/meta-llama/llama-3-1-70b-instruct",
    base_url="https://api.watsonx.ai/v1"
)
  1. Install Ollama: ollama.ai
  2. Run a model: ollama run llama3
  3. Configure:
Code
llm = LLM(
    model="ollama/llama3:70b",
    base_url="http://localhost:11434"
)
Set the following environment variables in your .env file:
Code
FIREWORKS_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="fireworks_ai/accounts/fireworks/models/llama-v3-70b-instruct",
    temperature=0.7
)
Set the following environment variables in your .env file:
Code
PERPLEXITY_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="llama-3.1-sonar-large-128k-online",
    base_url="https://api.perplexity.ai/"
)
Set the following environment variables in your .env file:
Code
HF_TOKEN=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct"
)
Set the following environment variables in your .env file:
Code
SAMBANOVA_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="sambanova/Meta-Llama-3.1-8B-Instruct",
    temperature=0.7
)
ModelContext WindowBest For
Llama 3.1 70B/8BUp to 131,072 tokensHigh-performance, large context tasks
Llama 3.1 405B8,192 tokensHigh-performance and output quality
Llama 3.2 Series8,192 tokensGeneral-purpose, multimodal tasks
Llama 3.3 70BUp to 131,072 tokensHigh-performance and output quality
Qwen2 familly8,192 tokensHigh-performance and output quality
Set the following environment variables in your .env file:
Code
# Required
CEREBRAS_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="cerebras/llama3.1-70b",
    temperature=0.7,
    max_tokens=8192
)
Cerebras features:
  • Fast inference speeds
  • Competitive pricing
  • Good balance of speed and quality
  • Support for long context windows
Set the following environment variables in your .env file:
Code
OPENROUTER_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="openrouter/deepseek/deepseek-r1",
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY
)
Open Router models:
  • openrouter/deepseek/deepseek-r1
  • openrouter/deepseek/deepseek-chat
Set the following environment variables in your .env file:
Code
NEBIUS_API_KEY=<your-api-key>
Example usage in your CrewAI project:
Code
llm = LLM(
    model="nebius/Qwen/Qwen3-30B-A3B"
)
Nebius AI Studio features:
  • Large collection of open source models
  • Higher rate limits
  • Competitive pricing
  • Good balance of speed and quality

Streaming Responses

CrewAI supports streaming responses from LLMs, allowing your application to receive and process outputs in real-time as they’re generated.
  • Basic Setup
  • Event Handling
  • Agent & Task Tracking
Enable streaming by setting the stream parameter to True when initializing your LLM:
from crewai import LLM

# Create an LLM with streaming enabled
llm = LLM(
    model="openai/gpt-4o",
    stream=True  # Enable streaming
)
When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.

Structured LLM Calls

CrewAI supports structured responses from LLM calls by allowing you to define a response_format using a Pydantic model. This enables the framework to automatically parse and validate the output, making it easier to integrate the response into your application without manual post-processing. For example, you can define a Pydantic model to represent the expected response structure and pass it as the response_format when instantiating the LLM. The model will then be used to convert the LLM output into a structured Python object.
Code
from crewai import LLM

class Dog(BaseModel):
    name: str
    age: int
    breed: str


llm = LLM(model="gpt-4o", response_format=Dog)

response = llm.call(
    "Analyze the following messages and return the name, age, and breed. "
    "Meet Kona! She is 3 years old and is a black german shepherd."
)
print(response)

# Output:
# Dog(name='Kona', age=3, breed='black german shepherd')

Advanced Features and Optimization

Learn how to get the most out of your LLM configuration:
CrewAI includes smart context management features:
from crewai import LLM

# CrewAI automatically handles:
# 1. Token counting and tracking
# 2. Content summarization when needed
# 3. Task splitting for large contexts

llm = LLM(
    model="gpt-4",
    max_tokens=4000,  # Limit response length
)
Best practices for context management:
  1. Choose models with appropriate context windows
  2. Pre-process long inputs when possible
  3. Use chunking for large documents
  4. Monitor token usage to optimize costs
1

Token Usage Optimization

Choose the right context window for your task:
  • Small tasks (up to 4K tokens): Standard models
  • Medium tasks (between 4K-32K): Enhanced models
  • Large tasks (over 32K): Large context models
# Configure model with appropriate settings
llm = LLM(
    model="openai/gpt-4-turbo-preview",
    temperature=0.7,    # Adjust based on task
    max_tokens=4096,    # Set based on output needs
    timeout=300        # Longer timeout for complex tasks
)
  • Lower temperature (0.1 to 0.3) for factual responses
  • Higher temperature (0.7 to 0.9) for creative tasks
2

Best Practices

  1. Monitor token usage
  2. Implement rate limiting
  3. Use caching when possible
  4. Set appropriate max_tokens limits
Remember to regularly monitor your token usage and adjust your configuration as needed to optimize costs and performance.
CrewAI internally uses native sdks for LLM calls, which allows you to drop additional parameters that are not needed for your specific use case. This can help simplify your code and reduce the complexity of your LLM configuration. For example, if you don’t need to send the stop parameter, you can simply omit it from your LLM call:
from crewai import LLM
import os

os.environ["OPENAI_API_KEY"] = "<api-key>"

o3_llm = LLM(
    model="o3",
    drop_params=True,
    additional_drop_params=["stop"]
)

Common Issues and Solutions

  • Authentication
  • Model Names
  • Context Length
Most authentication issues can be resolved by checking API key format and environment variable names.
# OpenAI
OPENAI_API_KEY=sk-...

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...