LLMs

Overview

CrewAI integrates with multiple LLM providers through LiteLLM, giving you the flexibility to choose the right model for your specific use case. This guide will help you understand how to configure and use different LLM providers in your CrewAI projects.

What are LLMs?

Large Language Models (LLMs) are the core intelligence behind CrewAI agents. They enable agents to understand context, make decisions, and generate human-like responses. Here’s what you need to know:

LLM Basics

Large Language Models are AI systems trained on vast amounts of text data. They power the intelligence of your CrewAI agents, enabling them to understand and generate human-like text.

Context Window

The context window determines how much text an LLM can process at once. Larger windows (e.g., 128K tokens) allow for more context but may be more expensive and slower.

Temperature

Temperature (0.0 to 1.0) controls response randomness. Lower values (e.g., 0.2) produce more focused, deterministic outputs, while higher values (e.g., 0.8) increase creativity and variability.

Provider Selection

Each LLM provider (e.g., OpenAI, Anthropic, Google) offers different models with varying capabilities, pricing, and features. Choose based on your needs for accuracy, speed, and cost.

Setting up your LLM

There are different places in CrewAI code where you can specify the model to use. Once you specify the model you are using, you will need to provide the configuration (like an API key) for each of the model providers you use. See the provider configuration examples section for your provider.

The simplest way to get started. Set the model in your environment directly, through an .env file or in your app code. If you used crewai create to bootstrap your project, it will be set already.

.env

MODEL=model-id  # e.g. gpt-4o, gemini-2.0-flash, claude-3-sonnet-...

# Be sure to set your API keys here too. See the Provider
# section below.

Never commit API keys to version control. Use environment files (.env) or your system’s secret management.

The simplest way to get started. Set the model in your environment directly, through an .env file or in your app code. If you used crewai create to bootstrap your project, it will be set already.

.env

MODEL=model-id  # e.g. gpt-4o, gemini-2.0-flash, claude-3-sonnet-...

# Be sure to set your API keys here too. See the Provider
# section below.

Never commit API keys to version control. Use environment files (.env) or your system’s secret management.

Create a YAML file to define your agent configurations. This method is great for version control and team collaboration:

agents.yaml

researcher:
    role: Research Specialist
    goal: Conduct comprehensive research and analysis
    backstory: A dedicated research professional with years of experience
    verbose: true
    llm: provider/model-id  # e.g. openai/gpt-4o, google/gemini-2.0-flash, anthropic/claude...
    # (see provider configuration examples below for more)

The YAML configuration allows you to:

Version control your agent settings
Easily switch between different models
Share configurations across team members
Document model choices and their purposes

For maximum flexibility, configure LLMs directly in your Python code:

from crewai import LLM

# Basic configuration
llm = LLM(model="model-id-here")  # gpt-4o, gemini-2.0-flash, anthropic/claude...

# Advanced configuration with detailed parameters
llm = LLM(
    model="model-id-here",  # gpt-4o, gemini-2.0-flash, anthropic/claude...
    temperature=0.7,        # Higher for more creative outputs
    timeout=120,            # Seconds to wait for response
    max_tokens=4000,        # Maximum length of response
    top_p=0.9,              # Nucleus sampling parameter
    frequency_penalty=0.1 , # Reduce repetition
    presence_penalty=0.1,   # Encourage topic diversity
    response_format={"type": "json"},  # For structured outputs
    seed=42                 # For reproducible results
)

Parameter explanations:

temperature: Controls randomness (0.0-1.0)
timeout: Maximum wait time for response
max_tokens: Limits response length
top_p: Alternative to temperature for sampling
frequency_penalty: Reduces word repetition
presence_penalty: Encourages new topics
response_format: Specifies output structure
seed: Ensures consistent outputs

Provider Configuration Examples

CrewAI supports a multitude of LLM providers, each offering unique features, authentication methods, and model capabilities. In this section, you’ll find detailed examples that help you select, configure, and optimize the LLM that best fits your project’s needs.

OpenAI

Set the following environment variables in your .env file:

Code

# Required
OPENAI_API_KEY=sk-...

# Optional
OPENAI_API_BASE=<custom-base-url>
OPENAI_ORGANIZATION=<your-org-id>

Example usage in your CrewAI project:

Code

from crewai import LLM

llm = LLM(
    model="openai/gpt-4", # call model by provider/model_name
    temperature=0.8,
    max_tokens=150,
    top_p=0.9,
    frequency_penalty=0.1,
    presence_penalty=0.1,
    stop=["END"],
    seed=42
)

OpenAI is one of the leading providers of LLMs with a wide range of models and features.

Model	Context Window	Best For
GPT-4	8,192 tokens	High-accuracy tasks, complex reasoning
GPT-4 Turbo	128,000 tokens	Long-form content, document analysis
GPT-4o & GPT-4o-mini	128,000 tokens	Cost-effective large context processing
o3-mini	200,000 tokens	Fast reasoning, complex reasoning
o1-mini	128,000 tokens	Fast reasoning, complex reasoning
o1-preview	128,000 tokens	Fast reasoning, complex reasoning
o1	200,000 tokens	Fast reasoning, complex reasoning

Meta-Llama

Meta’s Llama API provides access to Meta’s family of large language models. The API is available through the Meta Llama API. Set the following environment variables in your .env file:

Code

# Meta Llama API Key Configuration
LLAMA_API_KEY=LLM|your_api_key_here

Example usage in your CrewAI project:

Code

from crewai import LLM

# Initialize Meta Llama LLM
llm = LLM(
    model="meta_llama/Llama-4-Scout-17B-16E-Instruct-FP8",
    temperature=0.8,
    stop=["END"],
    seed=42
)

All models listed here https://llama.developer.meta.com/docs/models/ are supported.

Model ID	Input context length	Output context length	Input Modalities	Output Modalities
`meta_llama/Llama-4-Scout-17B-16E-Instruct-FP8`	128k	4028	Text, Image	Text
`meta_llama/Llama-4-Maverick-17B-128E-Instruct-FP8`	128k	4028	Text, Image	Text
`meta_llama/Llama-3.3-70B-Instruct`	128k	4028	Text	Text
`meta_llama/Llama-3.3-8B-Instruct`	128k	4028	Text	Text

Anthropic

Code

# Required
ANTHROPIC_API_KEY=sk-ant-...

# Optional
ANTHROPIC_API_BASE=<custom-base-url>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="anthropic/claude-3-sonnet-20240229-v1:0",
    temperature=0.7
)

Google (Gemini API)

Set your API key in your .env file. If you need a key, or need to find an existing key, check AI Studio.

.env

# https://ai.google.dev/gemini-api/docs/api-key
GEMINI_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

from crewai import LLM

llm = LLM(
    model="gemini/gemini-2.0-flash",
    temperature=0.7,
)

Gemini models

Google offers a range of powerful models optimized for different use cases.

Model	Context Window	Best For
gemini-2.5-flash-preview-04-17	1M tokens	Adaptive thinking, cost efficiency
gemini-2.5-pro-preview-05-06	1M tokens	Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more
gemini-2.0-flash	1M tokens	Next generation features, speed, thinking, and realtime streaming
gemini-2.0-flash-lite	1M tokens	Cost efficiency and low latency
gemini-1.5-flash	1M tokens	Balanced multimodal model, good for most tasks
gemini-1.5-flash-8B	1M tokens	Fastest, most cost-efficient, good for high-frequency tasks
gemini-1.5-pro	2M tokens	Best performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration

The full list of models is available in the Gemini model docs.

Gemma

The Gemini API also allows you to use your API key to access Gemma models hosted on Google infrastructure.

Model	Context Window
gemma-3-1b-it	32k tokens
gemma-3-4b-it	32k tokens
gemma-3-12b-it	32k tokens
gemma-3-27b-it	128k tokens

Google (Vertex AI)

Get credentials from your Google Cloud Console and save it to a JSON file, then load it with the following code:

Code

import json

file_path = 'path/to/vertex_ai_service_account.json'

# Load the JSON file
with open(file_path, 'r') as file:
    vertex_credentials = json.load(file)

# Convert the credentials to a JSON string
vertex_credentials_json = json.dumps(vertex_credentials)

Example usage in your CrewAI project:

Code

from crewai import LLM

llm = LLM(
    model="gemini/gemini-1.5-pro-latest",
    temperature=0.7,
    vertex_credentials=vertex_credentials_json
)

Google offers a range of powerful models optimized for different use cases:

Model	Context Window	Best For
gemini-2.5-flash-preview-04-17	1M tokens	Adaptive thinking, cost efficiency
gemini-2.5-pro-preview-05-06	1M tokens	Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more
gemini-2.0-flash	1M tokens	Next generation features, speed, thinking, and realtime streaming
gemini-2.0-flash-lite	1M tokens	Cost efficiency and low latency
gemini-1.5-flash	1M tokens	Balanced multimodal model, good for most tasks
gemini-1.5-flash-8B	1M tokens	Fastest, most cost-efficient, good for high-frequency tasks
gemini-1.5-pro	2M tokens	Best performing, wide variety of reasoning tasks including logical reasoning, coding, and creative collaboration

Azure

Code

# Required
AZURE_API_KEY=<your-api-key>
AZURE_API_BASE=<your-resource-url>
AZURE_API_VERSION=<api-version>

# Optional
AZURE_AD_TOKEN=<your-azure-ad-token>
AZURE_API_TYPE=<your-azure-api-type>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="azure/gpt-4",
    api_version="2023-05-15"
)

AWS Bedrock

Code

AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>
AWS_DEFAULT_REGION=<your-region>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
)

Before using Amazon Bedrock, make sure you have boto3 installed in your environment

Amazon Bedrock is a managed service that provides access to multiple foundation models from top AI companies through a unified API, enabling secure and responsible AI application development.

Model	Context Window	Best For
Amazon Nova Pro	Up to 300k tokens	High-performance, model balancing accuracy, speed, and cost-effectiveness across diverse tasks.
Amazon Nova Micro	Up to 128k tokens	High-performance, cost-effective text-only model optimized for lowest latency responses.
Amazon Nova Lite	Up to 300k tokens	High-performance, affordable multimodal processing for images, video, and text with real-time capabilities.
Claude 3.7 Sonnet	Up to 128k tokens	High-performance, best for complex reasoning, coding & AI agents
Claude 3.5 Sonnet v2	Up to 200k tokens	State-of-the-art model specialized in software engineering, agentic capabilities, and computer interaction at optimized cost.
Claude 3.5 Sonnet	Up to 200k tokens	High-performance model delivering superior intelligence and reasoning across diverse tasks with optimal speed-cost balance.
Claude 3.5 Haiku	Up to 200k tokens	Fast, compact multimodal model optimized for quick responses and seamless human-like interactions
Claude 3 Sonnet	Up to 200k tokens	Multimodal model balancing intelligence and speed for high-volume deployments.
Claude 3 Haiku	Up to 200k tokens	Compact, high-speed multimodal model optimized for quick responses and natural conversational interactions
Claude 3 Opus	Up to 200k tokens	Most advanced multimodal model exceling at complex tasks with human-like reasoning and superior contextual understanding.
Claude 2.1	Up to 200k tokens	Enhanced version with expanded context window, improved reliability, and reduced hallucinations for long-form and RAG applications
Claude	Up to 100k tokens	Versatile model excelling in sophisticated dialogue, creative content, and precise instruction following.
Claude Instant	Up to 100k tokens	Fast, cost-effective model for everyday tasks like dialogue, analysis, summarization, and document Q&A
Llama 3.1 405B Instruct	Up to 128k tokens	Advanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks.
Llama 3.1 70B Instruct	Up to 128k tokens	Powers complex conversations with superior contextual understanding, reasoning and text generation.
Llama 3.1 8B Instruct	Up to 128k tokens	Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.
Llama 3 70B Instruct	Up to 8k tokens	Powers complex conversations with superior contextual understanding, reasoning and text generation.
Llama 3 8B Instruct	Up to 8k tokens	Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
Titan Text G1 - Lite	Up to 4k tokens	Lightweight, cost-effective model optimized for English tasks and fine-tuning with focus on summarization and content generation.
Titan Text G1 - Express	Up to 8k tokens	Versatile model for general language tasks, chat, and RAG applications with support for English and 100+ languages.
Cohere Command	Up to 4k tokens	Model specialized in following user commands and delivering practical enterprise solutions.
Jurassic-2 Mid	Up to 8,191 tokens	Cost-effective model balancing quality and affordability for diverse language tasks like Q&A, summarization, and content generation.
Jurassic-2 Ultra	Up to 8,191 tokens	Model for advanced text generation and comprehension, excelling in complex tasks like analysis and content creation.
Jamba-Instruct	Up to 256k tokens	Model with extended context window optimized for cost-effective text generation, summarization, and Q&A.
Mistral 7B Instruct	Up to 32k tokens	This LLM follows instructions, completes requests, and generates creative text.
Mistral 8x7B Instruct	Up to 32k tokens	An MOE LLM that follows instructions, completes requests, and generates creative text.

Amazon SageMaker

Code

AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>
AWS_DEFAULT_REGION=<your-region>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="sagemaker/<my-endpoint>"
)

Mistral

Set the following environment variables in your .env file:

Code

MISTRAL_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="mistral/mistral-large-latest",
    temperature=0.7
)

Nvidia NIM

Set the following environment variables in your .env file:

Code

NVIDIA_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="nvidia_nim/meta/llama3-70b-instruct",
    temperature=0.7
)

Nvidia NIM provides a comprehensive suite of models for various use cases, from general-purpose tasks to specialized applications.

Model	Context Window	Best For
nvidia/mistral-nemo-minitron-8b-8k-instruct	8,192 tokens	State-of-the-art small language model delivering superior accuracy for chatbot, virtual assistants, and content generation.
nvidia/nemotron-4-mini-hindi-4b-instruct	4,096 tokens	A bilingual Hindi-English SLM for on-device inference, tailored specifically for Hindi Language.
nvidia/llama-3.1-nemotron-70b-instruct	128k tokens	Customized for enhanced helpfulness in responses
nvidia/llama3-chatqa-1.5-8b	128k tokens	Advanced LLM to generate high-quality, context-aware responses for chatbots and search engines.
nvidia/llama3-chatqa-1.5-70b	128k tokens	Advanced LLM to generate high-quality, context-aware responses for chatbots and search engines.
nvidia/vila	128k tokens	Multi-modal vision-language model that understands text/img/video and creates informative responses
nvidia/neva-22	4,096 tokens	Multi-modal vision-language model that understands text/images and generates informative responses
nvidia/nemotron-mini-4b-instruct	8,192 tokens	General-purpose tasks
nvidia/usdcode-llama3-70b-instruct	128k tokens	State-of-the-art LLM that answers OpenUSD knowledge queries and generates USD-Python code.
nvidia/nemotron-4-340b-instruct	4,096 tokens	Creates diverse synthetic data that mimics the characteristics of real-world data.
meta/codellama-70b	100k tokens	LLM capable of generating code from natural language and vice versa.
meta/llama2-70b	4,096 tokens	Cutting-edge large language AI model capable of generating text and code in response to prompts.
meta/llama3-8b-instruct	8,192 tokens	Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
meta/llama3-70b-instruct	8,192 tokens	Powers complex conversations with superior contextual understanding, reasoning and text generation.
meta/llama-3.1-8b-instruct	128k tokens	Advanced state-of-the-art model with language understanding, superior reasoning, and text generation.
meta/llama-3.1-70b-instruct	128k tokens	Powers complex conversations with superior contextual understanding, reasoning and text generation.
meta/llama-3.1-405b-instruct	128k tokens	Advanced LLM for synthetic data generation, distillation, and inference for chatbots, coding, and domain-specific tasks.
meta/llama-3.2-1b-instruct	128k tokens	Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
meta/llama-3.2-3b-instruct	128k tokens	Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
meta/llama-3.2-11b-vision-instruct	128k tokens	Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
meta/llama-3.2-90b-vision-instruct	128k tokens	Advanced state-of-the-art small language model with language understanding, superior reasoning, and text generation.
google/gemma-7b	8,192 tokens	Cutting-edge text generation model text understanding, transformation, and code generation.
google/gemma-2b	8,192 tokens	Cutting-edge text generation model text understanding, transformation, and code generation.
google/codegemma-7b	8,192 tokens	Cutting-edge model built on Google’s Gemma-7B specialized for code generation and code completion.
google/codegemma-1.1-7b	8,192 tokens	Advanced programming model for code generation, completion, reasoning, and instruction following.
google/recurrentgemma-2b	8,192 tokens	Novel recurrent architecture based language model for faster inference when generating long sequences.
google/gemma-2-9b-it	8,192 tokens	Cutting-edge text generation model text understanding, transformation, and code generation.
google/gemma-2-27b-it	8,192 tokens	Cutting-edge text generation model text understanding, transformation, and code generation.
google/gemma-2-2b-it	8,192 tokens	Cutting-edge text generation model text understanding, transformation, and code generation.
google/deplot	512 tokens	One-shot visual language understanding model that translates images of plots into tables.
google/paligemma	8,192 tokens	Vision language model adept at comprehending text and visual inputs to produce informative responses.
mistralai/mistral-7b-instruct-v0.2	32k tokens	This LLM follows instructions, completes requests, and generates creative text.
mistralai/mixtral-8x7b-instruct-v0.1	8,192 tokens	An MOE LLM that follows instructions, completes requests, and generates creative text.
mistralai/mistral-large	4,096 tokens	Creates diverse synthetic data that mimics the characteristics of real-world data.
mistralai/mixtral-8x22b-instruct-v0.1	8,192 tokens	Creates diverse synthetic data that mimics the characteristics of real-world data.
mistralai/mistral-7b-instruct-v0.3	32k tokens	This LLM follows instructions, completes requests, and generates creative text.
nv-mistralai/mistral-nemo-12b-instruct	128k tokens	Most advanced language model for reasoning, code, multilingual tasks; runs on a single GPU.
mistralai/mamba-codestral-7b-v0.1	256k tokens	Model for writing and interacting with code across a wide range of programming languages and tasks.
microsoft/phi-3-mini-128k-instruct	128K tokens	Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-mini-4k-instruct	4,096 tokens	Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-small-8k-instruct	8,192 tokens	Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-small-128k-instruct	128K tokens	Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-medium-4k-instruct	4,096 tokens	Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3-medium-128k-instruct	128K tokens	Lightweight, state-of-the-art open LLM with strong math and logical reasoning skills.
microsoft/phi-3.5-mini-instruct	128K tokens	Lightweight multilingual LLM powering AI applications in latency bound, memory/compute constrained environments
microsoft/phi-3.5-moe-instruct	128K tokens	Advanced LLM based on Mixture of Experts architecture to deliver compute efficient content generation
microsoft/kosmos-2	1,024 tokens	Groundbreaking multimodal model designed to understand and reason about visual elements in images.
microsoft/phi-3-vision-128k-instruct	128k tokens	Cutting-edge open multimodal model exceling in high-quality reasoning from images.
microsoft/phi-3.5-vision-instruct	128k tokens	Cutting-edge open multimodal model exceling in high-quality reasoning from images.
databricks/dbrx-instruct	12k tokens	A general-purpose LLM with state-of-the-art performance in language understanding, coding, and RAG.
snowflake/arctic	1,024 tokens	Delivers high efficiency inference for enterprise applications focused on SQL generation and coding.
aisingapore/sea-lion-7b-instruct	4,096 tokens	LLM to represent and serve the linguistic and cultural diversity of Southeast Asia
ibm/granite-8b-code-instruct	4,096 tokens	Software programming LLM for code generation, completion, explanation, and multi-turn conversion.
ibm/granite-34b-code-instruct	8,192 tokens	Software programming LLM for code generation, completion, explanation, and multi-turn conversion.
ibm/granite-3.0-8b-instruct	4,096 tokens	Advanced Small Language Model supporting RAG, summarization, classification, code, and agentic AI
ibm/granite-3.0-3b-a800m-instruct	4,096 tokens	Highly efficient Mixture of Experts model for RAG, summarization, entity extraction, and classification
mediatek/breeze-7b-instruct	4,096 tokens	Creates diverse synthetic data that mimics the characteristics of real-world data.
upstage/solar-10.7b-instruct	4,096 tokens	Excels in NLP tasks, particularly in instruction-following, reasoning, and mathematics.
writer/palmyra-med-70b-32k	32k tokens	Leading LLM for accurate, contextually relevant responses in the medical domain.
writer/palmyra-med-70b	32k tokens	Leading LLM for accurate, contextually relevant responses in the medical domain.
writer/palmyra-fin-70b-32k	32k tokens	Specialized LLM for financial analysis, reporting, and data processing
01-ai/yi-large	32k tokens	Powerful model trained on English and Chinese for diverse tasks including chatbot and creative writing.
deepseek-ai/deepseek-coder-6.7b-instruct	2k tokens	Powerful coding model offering advanced capabilities in code generation, completion, and infilling
rakuten/rakutenai-7b-instruct	1,024 tokens	Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
rakuten/rakutenai-7b-chat	1,024 tokens	Advanced state-of-the-art LLM with language understanding, superior reasoning, and text generation.
baichuan-inc/baichuan2-13b-chat	4,096 tokens	Support Chinese and English chat, coding, math, instruction following, solving quizzes

Local NVIDIA NIM Deployed using WSL2

NVIDIA NIM enables you to run powerful LLMs locally on your Windows machine using WSL2 (Windows Subsystem for Linux). This approach allows you to leverage your NVIDIA GPU for private, secure, and cost-effective AI inference without relying on cloud services. Perfect for development, testing, or production scenarios where data privacy or offline capabilities are required.

Here is a step-by-step guide to setting up a local NVIDIA NIM model:

Follow installation instructions from NVIDIA Website
Install the local model. For Llama 3.1-8b follow instructions
Configure your crewai local models:

Code

from crewai.llm import LLM

local_nvidia_nim_llm = LLM(
    model="openai/meta/llama-3.1-8b-instruct", # it's an openai-api compatible model
    base_url="http://localhost:8000/v1",
    api_key="<your_api_key|any text if you have not configured it>", # api_key is required, but you can use any text
)

# Then you can use it in your crew:

@CrewBase
class MyCrew():
    # ...

    @agent
    def researcher(self) -> Agent:
        return Agent(
            config=self.agents_config['researcher'], # type: ignore[index]
            llm=local_nvidia_nim_llm
        )

    # ...

Groq

Set the following environment variables in your .env file:

Code

GROQ_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="groq/llama-3.2-90b-text-preview",
    temperature=0.7
)

Model	Context Window	Best For
Llama 3.1 70B/8B	131,072 tokens	High-performance, large context tasks
Llama 3.2 Series	8,192 tokens	General-purpose tasks
Mixtral 8x7B	32,768 tokens	Balanced performance and context

IBM watsonx.ai

Set the following environment variables in your .env file:

Code

# Required
WATSONX_URL=<your-url>
WATSONX_APIKEY=<your-apikey>
WATSONX_PROJECT_ID=<your-project-id>

# Optional
WATSONX_TOKEN=<your-token>
WATSONX_DEPLOYMENT_SPACE_ID=<your-space-id>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="watsonx/meta-llama/llama-3-1-70b-instruct",
    base_url="https://api.watsonx.ai/v1"
)

Ollama (Local LLMs)

Fireworks AI

Set the following environment variables in your .env file:

Code

FIREWORKS_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="fireworks_ai/accounts/fireworks/models/llama-v3-70b-instruct",
    temperature=0.7
)

Perplexity AI

Set the following environment variables in your .env file:

Code

PERPLEXITY_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="llama-3.1-sonar-large-128k-online",
    base_url="https://api.perplexity.ai/"
)

Hugging Face

Set the following environment variables in your .env file:

Code

HF_TOKEN=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct"
)

SambaNova

Set the following environment variables in your .env file:

Code

SAMBANOVA_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="sambanova/Meta-Llama-3.1-8B-Instruct",
    temperature=0.7
)

Model	Context Window	Best For
Llama 3.1 70B/8B	Up to 131,072 tokens	High-performance, large context tasks
Llama 3.1 405B	8,192 tokens	High-performance and output quality
Llama 3.2 Series	8,192 tokens	General-purpose, multimodal tasks
Llama 3.3 70B	Up to 131,072 tokens	High-performance and output quality
Qwen2 familly	8,192 tokens	High-performance and output quality

Cerebras

Set the following environment variables in your .env file:

Code

# Required
CEREBRAS_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="cerebras/llama3.1-70b",
    temperature=0.7,
    max_tokens=8192
)

Cerebras features:

Fast inference speeds
Competitive pricing
Good balance of speed and quality
Support for long context windows

Open Router

Set the following environment variables in your .env file:

Code

OPENROUTER_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="openrouter/deepseek/deepseek-r1",
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY
)

Open Router models:

openrouter/deepseek/deepseek-r1
openrouter/deepseek/deepseek-chat

Nebius AI Studio

Set the following environment variables in your .env file:

Code

NEBIUS_API_KEY=<your-api-key>

Example usage in your CrewAI project:

Code

llm = LLM(
    model="nebius/Qwen/Qwen3-30B-A3B"
)

Nebius AI Studio features:

Large collection of open source models
Higher rate limits
Competitive pricing
Good balance of speed and quality

Streaming Responses

CrewAI supports streaming responses from LLMs, allowing your application to receive and process outputs in real-time as they’re generated.

Enable streaming by setting the stream parameter to True when initializing your LLM:

from crewai import LLM

# Create an LLM with streaming enabled
llm = LLM(
    model="openai/gpt-4o",
    stream=True  # Enable streaming
)

When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.

Enable streaming by setting the stream parameter to True when initializing your LLM:

from crewai import LLM

# Create an LLM with streaming enabled
llm = LLM(
    model="openai/gpt-4o",
    stream=True  # Enable streaming
)

When streaming is enabled, responses are delivered in chunks as they’re generated, creating a more responsive user experience.

CrewAI emits events for each chunk received during streaming:

from crewai.utilities.events import (
  LLMStreamChunkEvent
)
from crewai.utilities.events.base_event_listener import BaseEventListener

class MyCustomListener(BaseEventListener):
    def setup_listeners(self, crewai_event_bus):
        @crewai_event_bus.on(LLMStreamChunkEvent)
        def on_llm_stream_chunk(self, event: LLMStreamChunkEvent):
          # Process each chunk as it arrives
          print(f"Received chunk: {event.chunk}")

my_listener = MyCustomListener()

Click here for more details

All LLM events in CrewAI include agent and task information, allowing you to track and filter LLM interactions by specific agents or tasks:

from crewai import LLM, Agent, Task, Crew
from crewai.utilities.events import LLMStreamChunkEvent
from crewai.utilities.events.base_event_listener import BaseEventListener

class MyCustomListener(BaseEventListener):
    def setup_listeners(self, crewai_event_bus):
        @crewai_event_bus.on(LLMStreamChunkEvent)
        def on_llm_stream_chunk(source, event):
            if researcher.id == event.agent_id:
                print("\n==============\n Got event:", event, "\n==============\n")


my_listener = MyCustomListener()

llm = LLM(model="gpt-4o-mini", temperature=0, stream=True)

researcher = Agent(
    role="About User",
    goal="You know everything about the user.",
    backstory="""You are a master at understanding people and their preferences.""",
    llm=llm,
)

search = Task(
    description="Answer the following questions about the user: {question}",
    expected_output="An answer to the question.",
    agent=researcher,
)

crew = Crew(agents=[researcher], tasks=[search])

result = crew.kickoff(
    inputs={"question": "..."}
)

This feature is particularly useful for:

Debugging specific agent behaviors
Logging LLM usage by task type
Auditing which agents are making what types of LLM calls
Performance monitoring of specific tasks

Structured LLM Calls

CrewAI supports structured responses from LLM calls by allowing you to define a response_format using a Pydantic model. This enables the framework to automatically parse and validate the output, making it easier to integrate the response into your application without manual post-processing.

For example, you can define a Pydantic model to represent the expected response structure and pass it as the response_format when instantiating the LLM. The model will then be used to convert the LLM output into a structured Python object.

Code

from crewai import LLM

class Dog(BaseModel):
    name: str
    age: int
    breed: str


llm = LLM(model="gpt-4o", response_format=Dog)

response = llm.call(
    "Analyze the following messages and return the name, age, and breed. "
    "Meet Kona! She is 3 years old and is a black german shepherd."
)
print(response)

# Output:
# Dog(name='Kona', age=3, breed='black german shepherd')

Advanced Features and Optimization

Learn how to get the most out of your LLM configuration:

Context Window Management

CrewAI includes smart context management features:

from crewai import LLM

# CrewAI automatically handles:
# 1. Token counting and tracking
# 2. Content summarization when needed
# 3. Task splitting for large contexts

llm = LLM(
    model="gpt-4",
    max_tokens=4000,  # Limit response length
)

Best practices for context management:

Choose models with appropriate context windows
Pre-process long inputs when possible
Use chunking for large documents
Monitor token usage to optimize costs

Performance Optimization

Token Usage Optimization

Choose the right context window for your task:

Small tasks (up to 4K tokens): Standard models
Medium tasks (between 4K-32K): Enhanced models
Large tasks (over 32K): Large context models

# Configure model with appropriate settings
llm = LLM(
    model="openai/gpt-4-turbo-preview",
    temperature=0.7,    # Adjust based on task
    max_tokens=4096,    # Set based on output needs
    timeout=300        # Longer timeout for complex tasks
)

Lower temperature (0.1 to 0.3) for factual responses
Higher temperature (0.7 to 0.9) for creative tasks

Best Practices

Monitor token usage
Implement rate limiting
Use caching when possible
Set appropriate max_tokens limits

Remember to regularly monitor your token usage and adjust your configuration as needed to optimize costs and performance.

Drop Additional Parameters

CrewAI internally uses Litellm for LLM calls, which allows you to drop additional parameters that are not needed for your specific use case. This can help simplify your code and reduce the complexity of your LLM configuration. For example, if you don’t need to send the stop parameter, you can simply omit it from your LLM call:

from crewai import LLM
import os

os.environ["OPENAI_API_KEY"] = "<api-key>"

o3_llm = LLM(
    model="o3",
    drop_params=True,
    additional_drop_params=["stop"]
)

Common Issues and Solutions

Most authentication issues can be resolved by checking API key format and environment variable names.

# OpenAI
OPENAI_API_KEY=sk-...

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

Most authentication issues can be resolved by checking API key format and environment variable names.

# OpenAI
OPENAI_API_KEY=sk-...

# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

Always include the provider prefix in model names

# Correct
llm = LLM(model="openai/gpt-4")

# Incorrect
llm = LLM(model="gpt-4")

Use larger context models for extensive tasks

# Large context model
llm = LLM(model="openai/gpt-4o")  # 128K tokens

Get Started

Guides

Core Concepts

MCP Integration

Tools

Observability

Learn

Telemetry

Overview

What are LLMs?

LLM Basics

Context Window

Temperature

Provider Selection

Setting up your LLM

Provider Configuration Examples

Gemini models

Gemma

Streaming Responses

Structured LLM Calls

Advanced Features and Optimization

Common Issues and Solutions

Get Started

Guides

Core Concepts

MCP Integration

Tools

Observability

Learn

Telemetry

​Overview

​What are LLMs?

LLM Basics

Context Window

Temperature

Provider Selection

​Setting up your LLM

​Provider Configuration Examples

​Streaming Responses

​Structured LLM Calls

​Advanced Features and Optimization

​Common Issues and Solutions

Overview

What are LLMs?

Setting up your LLM

Provider Configuration Examples

Streaming Responses

Structured LLM Calls

Advanced Features and Optimization

Common Issues and Solutions